All of lore.kernel.org
 help / color / mirror / Atom feed
* RAID performance
@ 2013-02-07  6:48 Adam Goryachev
  2013-02-07  6:51 ` Adam Goryachev
                   ` (4 more replies)
  0 siblings, 5 replies; 131+ messages in thread
From: Adam Goryachev @ 2013-02-07  6:48 UTC (permalink / raw)
  To: linux-raid

Hi all,

I'm trying to resolve a significant performance issue (not arbitrary dd
tests, etc but real users complaining, real workload performance).

I'm currently using 5 x 480GB SSD's in a RAID5 as follows:
md1 : active raid5 sdf1[0] sdc1[4] sdb1[5] sdd1[3] sde1[1]
      1863535104 blocks super 1.2 level 5, 64k chunk, algorithm 2 [5/5]
[UUUUU]
      bitmap: 4/4 pages [16KB], 65536KB chunk

Each drive only has a single partition, and is partitioned a little
smaller than the drive (supposedly this should improve performance).
Each drive is set to the deadline scheduler.

Drives are:
Intel 520s MLC 480G SATA3
Supposedly Read 550M/Write 520M

I think the workload being generated is simply too much for the
underlying drives. I've been collecting the information from
/sys/block/<drive>/stat every 10 seconds for each drive. What makes me
think the drives are overworked is that the backlog value gets very high
at the same time the users complain about performance.

The load is a bunch of windows VM's, which were working fine until
recently when I migrated the main fileserver/domain controller on
(previously it was a single SCSI Ultra320 disk on a standalone machine).
Hence, this also seems to indicate a lack of performance.

Currently the SSD's are connected to the onboard SATA ports (only SATA II):
00:1f.2 SATA controller: Intel Corporation Cougar Point 6 port SATA AHCI
Controller (rev 05)

There is one additional SSD which is just the OS drive also connected,
but it is mostly idle (all it does is log the stats/etc).

Assuming the issue is underlying hardware, then I'm thinking to do the
following:
1) Get a battery backed RAID controller card (which should improve
latency because the OS can pretend it is written while the card deals
with writing it to disk).
2) Move from a 5 disk RAID5 to a 8 disk RAID10, giving better data
protection (can lose up to four drives) and hopefully better performance
(main concern right now), and same capacity as current.

The real questions are:
1) Is this data enough to say that the performance issue is due to
underlying hardware as opposed to a mis-configuration?
2) If so, any suggestions on specific hardware which would help?
3) Would removing the bitmap make an improvement to the performance?

Motherboard is Intel S1200BTLR Serverboard - 6xSATAII / Raid 0,1,10,5

It is possibly to wipe the array and re-create that would help.......

Any comments, suggestions, advice greatly received.

Thanks,
Adam

-- 
Adam Goryachev
Website Managers
www.websitemanagers.com.au


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance
  2013-02-07  6:48 RAID performance Adam Goryachev
@ 2013-02-07  6:51 ` Adam Goryachev
  2013-02-07  8:24   ` Stan Hoeppner
  2013-02-07  7:02 ` Carsten Aulbert
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 131+ messages in thread
From: Adam Goryachev @ 2013-02-07  6:51 UTC (permalink / raw)
  To: linux-raid

Sorry, graphs can be seen at:

http://203.98.89.64/graphs/


On 07/02/13 17:48, Adam Goryachev wrote:
> Hi all,
>
> I'm trying to resolve a significant performance issue (not arbitrary dd
> tests, etc but real users complaining, real workload performance).
>
> I'm currently using 5 x 480GB SSD's in a RAID5 as follows:
> md1 : active raid5 sdf1[0] sdc1[4] sdb1[5] sdd1[3] sde1[1]
>       1863535104 blocks super 1.2 level 5, 64k chunk, algorithm 2 [5/5]
> [UUUUU]
>       bitmap: 4/4 pages [16KB], 65536KB chunk
>
> Each drive only has a single partition, and is partitioned a little
> smaller than the drive (supposedly this should improve performance).
> Each drive is set to the deadline scheduler.
>
> Drives are:
> Intel 520s MLC 480G SATA3
> Supposedly Read 550M/Write 520M
>
> I think the workload being generated is simply too much for the
> underlying drives. I've been collecting the information from
> /sys/block/<drive>/stat every 10 seconds for each drive. What makes me
> think the drives are overworked is that the backlog value gets very high
> at the same time the users complain about performance.
>
> The load is a bunch of windows VM's, which were working fine until
> recently when I migrated the main fileserver/domain controller on
> (previously it was a single SCSI Ultra320 disk on a standalone machine).
> Hence, this also seems to indicate a lack of performance.
>
> Currently the SSD's are connected to the onboard SATA ports (only SATA II):
> 00:1f.2 SATA controller: Intel Corporation Cougar Point 6 port SATA AHCI
> Controller (rev 05)
>
> There is one additional SSD which is just the OS drive also connected,
> but it is mostly idle (all it does is log the stats/etc).
>
> Assuming the issue is underlying hardware, then I'm thinking to do the
> following:
> 1) Get a battery backed RAID controller card (which should improve
> latency because the OS can pretend it is written while the card deals
> with writing it to disk).
> 2) Move from a 5 disk RAID5 to a 8 disk RAID10, giving better data
> protection (can lose up to four drives) and hopefully better performance
> (main concern right now), and same capacity as current.
>
> The real questions are:
> 1) Is this data enough to say that the performance issue is due to
> underlying hardware as opposed to a mis-configuration?
> 2) If so, any suggestions on specific hardware which would help?
> 3) Would removing the bitmap make an improvement to the performance?
>
> Motherboard is Intel S1200BTLR Serverboard - 6xSATAII / Raid 0,1,10,5
>
> It is possibly to wipe the array and re-create that would help.......
>
> Any comments, suggestions, advice greatly received.
>
> Thanks,
> Adam
>


-- 
Adam Goryachev
Website Managers
Ph: +61 2 8304 0000                            adam@websitemanagers.com.au
Fax: +61 2 8304 0001                            www.websitemanagers.com.au


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance
  2013-02-07  6:48 RAID performance Adam Goryachev
  2013-02-07  6:51 ` Adam Goryachev
@ 2013-02-07  7:02 ` Carsten Aulbert
  2013-02-07 10:12   ` Adam Goryachev
  2013-02-07  8:11 ` Stan Hoeppner
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 131+ messages in thread
From: Carsten Aulbert @ 2013-02-07  7:02 UTC (permalink / raw)
  To: Adam Goryachev; +Cc: linux-raid

Hi Adam

On 02/07/2013 07:48 AM, Adam Goryachev wrote:
> 
> Each drive only has a single partition, and is partitioned a little
> smaller than the drive (supposedly this should improve performance).
> Each drive is set to the deadline scheduler.

First, I'd start with the deadline scheduler.

Even for rotating rust systems you can improve performance quite a bit,
but with SSDs you might hit the limits way too early and need to tune it.

Save the current default values somewhere safe, so you can go back
easily, then start tuning these values (as a start):

for i in $(grep \^md /proc/mdstat |cut -d' ' -f 5-); do
    DEV=${i:0:3}
    echo deadline > /sys/block/$DEV/queue/scheduler
    echo 4096  > /sys/block/$DEV/queue/nr_requests
    echo 8192  > /sys/block/$DEV/queue/read_ahead_kb
    echo 5000  > /sys/block/$DEV/queue/iosched/read_expire
    echo 1000 > /sys/block/$DEV/queue/iosched/write_expire
    echo 2048  > /sys/block/$DEV/queue/iosched/fifo_batch
done

At least setting these helped quite a bit.

HTH

carsten

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance
  2013-02-07  6:48 RAID performance Adam Goryachev
  2013-02-07  6:51 ` Adam Goryachev
  2013-02-07  7:02 ` Carsten Aulbert
@ 2013-02-07  8:11 ` Stan Hoeppner
  2013-02-07 10:05   ` Adam Goryachev
  2013-02-08  7:21   ` RAID performance Adam Goryachev
  2013-02-07  9:07 ` Dave Cundiff
  2013-02-07 11:32 ` Mikael Abrahamsson
  4 siblings, 2 replies; 131+ messages in thread
From: Stan Hoeppner @ 2013-02-07  8:11 UTC (permalink / raw)
  To: Adam Goryachev; +Cc: linux-raid

On 2/7/2013 12:48 AM, Adam Goryachev wrote:

> I'm trying to resolve a significant performance issue (not arbitrary dd
> tests, etc but real users complaining, real workload performance).

It's difficult to analyze your situation without even a basic
description of the workload(s).  What is the file access pattern?  What
types of files?

> I'm currently using 5 x 480GB SSD's in a RAID5 as follows:
> md1 : active raid5 sdf1[0] sdc1[4] sdb1[5] sdd1[3] sde1[1]
>       1863535104 blocks super 1.2 level 5, 64k chunk, algorithm 2 [5/5]
...
> Each drive is set to the deadline scheduler.

Switching to noop may help a little, as may disablig NCQ, i.e. putting
the driver in native IDE mode, or setting queue depth to 1.

> Drives are:
> Intel 520s MLC 480G SATA3
> Supposedly Read 550M/Write 520M

> I think the workload being generated is simply too much for the
> underlying drives. 

Not possible.  With an effective spindle width of 4, these SSDs can do
~80K random read/write IOPS sustained.  To put that into perspective,
you would need a ~$150,000 high end FC SAN array controller with 270 15K
SAS drives in RAID0 to get the same IOPS.

The problem is not the SSDs.  Probably not the controller either.

> I've been collecting the information from
> /sys/block/<drive>/stat every 10 seconds for each drive. What makes me
> think the drives are overworked is that the backlog value gets very high
> at the same time the users complain about performance.

What is "very high"?  Since you mention "backlog" I'll assume you're
referring to field #11.  If so, note that on my idle server (field #9 is
0), it is currently showing 434045280 for field #11.  That's apparently
a weighted value of milliseconds.  And apparently it's not reliable as a
diagnostic value.

What you should be looking at is field #9, which simply tells you how
may IOs are in progress.  But even if this number is high, which it can
be be very high with SSDs, it doesn't inform you if the drive is
performing properly or not.  What you should be using is ioptop or
something similar.  But this still isn't going to be all that informative.

> The load is a bunch of windows VM's, which were working fine until
> recently when I migrated the main fileserver/domain controller on
> (previously it was a single SCSI Ultra320 disk on a standalone machine).
> Hence, this also seems to indicate a lack of performance.

You just typed 4 lines and told us nothing of how this relates to the
problem you wish us to help you solve.  Please be detailed.

> Currently the SSD's are connected to the onboard SATA ports (only SATA II):
> 00:1f.2 SATA controller: Intel Corporation Cougar Point 6 port SATA AHCI
> Controller (rev 05)

Unless this Southbridge has a bug (I don't have time to research it),
then this isn't the problem.

> There is one additional SSD which is just the OS drive also connected,
> but it is mostly idle (all it does is log the stats/etc).

Irrelevant.

> Assuming the issue is underlying hardware

It's not.

> 1) Get a battery backed RAID controller card (which should improve
> latency because the OS can pretend it is written while the card deals
> with writing it to disk).

[BB/FB]WC is basically useless with SSDs.  LSI has the best boards, and
the "FastPath" option for SSDs basically disables the onboard cache to
get it out of the way.  Enterprise SSDs have extra capacitance allowing
for cache flushing on power loss so battery/flash protection on the RAID
card isn't necessary.  The write cache on the SSDs themselves is faster
in aggregate than the RAID card's ASIC and cache RAM interface, thus
having BBWC on the card enabled with SSDs actually slows you down.

So, in short, this isn't the answer to your problem, either.

> 2) Move from a 5 disk RAID5 to a 8 disk RAID10, giving better data
> protection (can lose up to four drives) and hopefully better performance
> (main concern right now), and same capacity as current.

You've got plenty of hardware performance.  Moving to RAID10 will simply
cost more money with no performance gain.  Here's why:

md/RAIAD5 and md/RAID10 both rely on a single write thread.  If you've
been paying attention on this list you know that patches are in the
works to fix this but are not, AFAIK, in mainline yet, and a long way
from being in distro kernels.  So, you've got maximum possible read
performance now, but your *write performance is limited to a single CPU
core* with both of these RAID drives.  If your problem is write
performance, your only solution at this time with md is to use a layered
RAID, such as RAID0 over RAID1 pairs, or linear over RAID1 pairs.  This
puts all of your cores in play for writes.

The reason this is an issue is that even a small number of SSDs can
overwhelm a single md thread, which is limited to one core of
throughput.  This has also been discussed thoroughly here recently.

> The real questions are:
> 1) Is this data enough to say that the performance issue is due to
> underlying hardware as opposed to a mis-configuration?

No, it's not.  We really need to have more specific workload data.

> 2) If so, any suggestions on specific hardware which would help?

It's not a hardware problem.  Given that it's a VM consolidation host,
I'd guess it's a hypervisor configuration problem.

> 3) Would removing the bitmap make an improvement to the performance?

I can't say this any more emphatically.  You have 5 of Intel's best
consumer SSDs and an Intel mainboard.  The problem is not your hardware.

> Motherboard is Intel S1200BTLR Serverboard - 6xSATAII / Raid 0,1,10,5
> 
> It is possibly to wipe the array and re-create that would help.......

Unless you're write IOPS starved due to md/RAID5 as I described above,
blowing away the array and creating a new one isn't going to help.  You
simply need to investigate further.

And if you would like continued assistance, you'd need to provide much
greater detail of the hardware and workload.  You didn't mention your
CPU(s) model/freq.  This matters greatly with RAID5 and SSD.  Nor RAM
type/capacity, network topology, nor number of users and what
applications they're running when they report the performance problem.
Nor did you mention which hypervisor kernel/distro you're using, how
many Windows VMs you're running, and the primary workload of each, etc,
etc, etc.

> Any comments, suggestions, advice greatly received.

More information, please.

-- 
Stan


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance
  2013-02-07  6:51 ` Adam Goryachev
@ 2013-02-07  8:24   ` Stan Hoeppner
  0 siblings, 0 replies; 131+ messages in thread
From: Stan Hoeppner @ 2013-02-07  8:24 UTC (permalink / raw)
  To: Adam Goryachev; +Cc: linux-raid

On 2/7/2013 12:51 AM, Adam Goryachev wrote:
> Sorry, graphs can be seen at:
> 
> http://203.98.89.64/graphs/

The graphs tell us nothing in isolation Adam.  What is needed is to know
what workloads are running when the device queues and response times
start piling up.  Whether rust or SSD, queues will increase, as well as
IO completion times, when sufficient IO is being performed.

-- 
Stan


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance
  2013-02-07  6:48 RAID performance Adam Goryachev
                   ` (2 preceding siblings ...)
  2013-02-07  8:11 ` Stan Hoeppner
@ 2013-02-07  9:07 ` Dave Cundiff
  2013-02-07 10:19   ` Adam Goryachev
  2013-02-11 19:49   ` Roy Sigurd Karlsbakk
  2013-02-07 11:32 ` Mikael Abrahamsson
  4 siblings, 2 replies; 131+ messages in thread
From: Dave Cundiff @ 2013-02-07  9:07 UTC (permalink / raw)
  To: Adam Goryachev; +Cc: linux-raid

On Thu, Feb 7, 2013 at 1:48 AM, Adam Goryachev
<mailinglists@websitemanagers.com.au> wrote:
> Hi all,
>
> I'm trying to resolve a significant performance issue (not arbitrary dd
> tests, etc but real users complaining, real workload performance).
>
> I'm currently using 5 x 480GB SSD's in a RAID5 as follows:
> md1 : active raid5 sdf1[0] sdc1[4] sdb1[5] sdd1[3] sde1[1]
>       1863535104 blocks super 1.2 level 5, 64k chunk, algorithm 2 [5/5]
> [UUUUU]
>       bitmap: 4/4 pages [16KB], 65536KB chunk
>
> Each drive only has a single partition, and is partitioned a little
> smaller than the drive (supposedly this should improve performance).
> Each drive is set to the deadline scheduler.
>
> Drives are:
> Intel 520s MLC 480G SATA3
> Supposedly Read 550M/Write 520M
>
> I think the workload being generated is simply too much for the
> underlying drives. I've been collecting the information from
> /sys/block/<drive>/stat every 10 seconds for each drive. What makes me
> think the drives are overworked is that the backlog value gets very high
> at the same time the users complain about performance.
>
> The load is a bunch of windows VM's, which were working fine until
> recently when I migrated the main fileserver/domain controller on
> (previously it was a single SCSI Ultra320 disk on a standalone machine).
> Hence, this also seems to indicate a lack of performance.
>
> Currently the SSD's are connected to the onboard SATA ports (only SATA II):
> 00:1f.2 SATA controller: Intel Corporation Cougar Point 6 port SATA AHCI
> Controller (rev 05)

Why would you plug thousands of dollars of SSD into an onboard
controller? It's probably running off a 1x PCIE shared with every
other onboard device. An LSI 8x 8 port HBA will run you a few
hundred(less than 1 SSD) and let you melt your northbridge. At least
on my Supermicro X8DTL boards I had to add active cooling to it or it
would overheat and crash at sustained IO. I can hit 2 - 2.5GB a second
doing large sequential IO with Samsung 840 Pros on a RAID10.

>
> There is one additional SSD which is just the OS drive also connected,
> but it is mostly idle (all it does is log the stats/etc).
>
> Assuming the issue is underlying hardware, then I'm thinking to do the
> following:
> 1) Get a battery backed RAID controller card (which should improve
> latency because the OS can pretend it is written while the card deals
> with writing it to disk).

As another person mentioned hardware raid is terrible for SSD. SSDs
are optimized to use the large RAM cache they have onboard. Most
hardware RAID controllers will disable it.

> 2) Move from a 5 disk RAID5 to a 8 disk RAID10, giving better data
> protection (can lose up to four drives) and hopefully better performance
> (main concern right now), and same capacity as current.

I've had strange issues with anything other than RAID1 or 10 with SSD.
Even with the high IO and IOP rates of SSDs the parity calcs and extra
writes still seem to penalize you greatly.

Also if your kernel does not have md TRIM support you risk taking a
SEVERE performance hit on writes. Once you complete a full write pass
on your NAND the SSD controller will require extra time to complete a
write. if your IO is mostly small and random this can cause your NAND
to become fragmented. If the fragmentation becomes bad enough you'll
be lucky to get 1 spinning disk worth of write IO out of all 5
combined.

-- 
Dave Cundiff
System Administrator
A2Hosting, Inc
http://www.a2hosting.com

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance
  2013-02-07  8:11 ` Stan Hoeppner
@ 2013-02-07 10:05   ` Adam Goryachev
  2013-02-16  4:33     ` RAID performance - *Slow SSDs likely solved* Stan Hoeppner
  2013-02-08  7:21   ` RAID performance Adam Goryachev
  1 sibling, 1 reply; 131+ messages in thread
From: Adam Goryachev @ 2013-02-07 10:05 UTC (permalink / raw)
  To: stan; +Cc: linux-raid

On 07/02/13 19:11, Stan Hoeppner wrote:
> On 2/7/2013 12:48 AM, Adam Goryachev wrote:
> 
>> I'm trying to resolve a significant performance issue (not arbitrary dd
>> tests, etc but real users complaining, real workload performance).
> 
> It's difficult to analyze your situation without even a basic
> description of the workload(s).  What is the file access pattern?  What
> types of files?
> 
>> I'm currently using 5 x 480GB SSD's in a RAID5 as follows:
>> md1 : active raid5 sdf1[0] sdc1[4] sdb1[5] sdd1[3] sde1[1]
>>       1863535104 blocks super 1.2 level 5, 64k chunk, algorithm 2 [5/5]
> ...
>> Each drive is set to the deadline scheduler.
> 
> Switching to noop may help a little, as may disablig NCQ, i.e. putting
> the driver in native IDE mode, or setting queue depth to 1.

I will make these changes, will see how it goes.
Would that be:
echo 1 > /sys/block/sdb/queue/nr_requests

>> Drives are:
>> Intel 520s MLC 480G SATA3
>> Supposedly Read 550M/Write 520M
> 
>> I think the workload being generated is simply too much for the
>> underlying drives. 
> 
> Not possible.  With an effective spindle width of 4, these SSDs can do
> ~80K random read/write IOPS sustained.  To put that into perspective,
> you would need a ~$150,000 high end FC SAN array controller with 270 15K
> SAS drives in RAID0 to get the same IOPS.
> 
> The problem is not the SSDs.  Probably not the controller either.

Well, that is a relief :) I did hope that spending all that money on the
SSD's was going to provide enough performance.... So, it's probably a
stupid user error... let me provide more info below.

>> I've been collecting the information from
>> /sys/block/<drive>/stat every 10 seconds for each drive. What makes me
>> think the drives are overworked is that the backlog value gets very high
>> at the same time the users complain about performance.
> 
> What is "very high"?  Since you mention "backlog" I'll assume you're
> referring to field #11.  If so, note that on my idle server (field #9 is
> 0), it is currently showing 434045280 for field #11.  That's apparently
> a weighted value of milliseconds.  And apparently it's not reliable as a
> diagnostic value.

Sorry, I was referring to the change in value of field 11 (backlog) over
10 second intervals. It sounded to me like this would be an indication
of how behind the drive was over that time period.

> What you should be looking at is field #9, which simply tells you how
> may IOs are in progress.  But even if this number is high, which it can
> be be very high with SSDs, it doesn't inform you if the drive is
> performing properly or not.  What you should be using is ioptop or
> something similar.  But this still isn't going to be all that informative.

OK, I've uploaded updated graphs showing:
device-8hr.png readsectors (field 3) and writesectors (field 7)
device_backlog-8hour (backlog field 11/activetime field 10)
device_queue-8hour (queue field 9)

I assume ioptop will simply collect values from either /sys or /proc or
similar, and present it in a nice report style format? Did you mean
iotop ? I've downloaded and installed that, but it seems to want to show
IO per process/thread, can you provide any hints on what values I should
be looking at? Maybe the raid5 thread? Is that:
  440 ?        S    183:07  \_ [md1_raid5]


>> The load is a bunch of windows VM's, which were working fine until
>> recently when I migrated the main fileserver/domain controller on
>> (previously it was a single SCSI Ultra320 disk on a standalone machine).
>> Hence, this also seems to indicate a lack of performance.
> 
> You just typed 4 lines and told us nothing of how this relates to the
> problem you wish us to help you solve.  Please be detailed.

I guess I'm not too sure how much information is too much.

OK, the entire system is as follows:
5 x 480G SSD
RAID5
DRBD (with the secondary disconnected during the day because this
improves performance similar to not using DRBD at all, though eventually
I'd like it online all the time...)
LVM to divide into multiple LV's
Each LV is then exported via iSCSI
There is 4 x 1G ethernet (using dual 1G Intel cards)
There are 6 physical machines using Xen 4.1 (debian testing)
Each physical box runs one or two windows VM's and has a single 1G ethernet

3 x Windows 2003 Terminal Servers (with 5 to 20 users each) running misc
apps such as Internet Explorer, MS Outlook, Word, Excel, etc

1 x MS SQL server MS Windows 2008 (basically unused, still in testing)

1 x Domain Controller MS Windows 2000 (this is the new one when the
problems started), this also has about 230G of data which is where most
users pst (outlook data file) lives, as well as word documents/etc.

1 x test workstation (MS Windows XP Pro, mostly unused, just for testing)

1 x MS Windows 2003 Server - Test Terminal Server, AVG (antivirus) admin
server, WSUS server, etc (this is hardly used).

1 x MS Windows 2003 Server - Application server, custom firebird
database application.

From the activity graphs I can see which LV's have lots of read/writes,
and the domain controller is the majority of the load, though if one of
the other machines happens to do lots of read or writes, then it will
impact all of the machines.

>> 1) Get a battery backed RAID controller card (which should improve
>> latency because the OS can pretend it is written while the card deals
>> with writing it to disk).
> 
> [BB/FB]WC is basically useless with SSDs.  LSI has the best boards, and
> the "FastPath" option for SSDs basically disables the onboard cache to
> get it out of the way.  Enterprise SSDs have extra capacitance allowing
> for cache flushing on power loss so battery/flash protection on the RAID
> card isn't necessary.  The write cache on the SSDs themselves is faster
> in aggregate than the RAID card's ASIC and cache RAM interface, thus
> having BBWC on the card enabled with SSDs actually slows you down.
> 
> So, in short, this isn't the answer to your problem, either.

So would it be safe to tell DRBD that I have a battery backed
controller? The system is protected by UPS of course, so power issues
*should* be minimal.

>> 2) Move from a 5 disk RAID5 to a 8 disk RAID10, giving better data
>> protection (can lose up to four drives) and hopefully better performance
>> (main concern right now), and same capacity as current.
> 
> You've got plenty of hardware performance.  Moving to RAID10 will simply
> cost more money with no performance gain.  Here's why:
> 
> md/RAIAD5 and md/RAID10 both rely on a single write thread.  If you've
> been paying attention on this list you know that patches are in the
> works to fix this but are not, AFAIK, in mainline yet, and a long way
> from being in distro kernels.  So, you've got maximum possible read
> performance now, but your *write performance is limited to a single CPU
> core* with both of these RAID drives.  If your problem is write
> performance, your only solution at this time with md is to use a layered
> RAID, such as RAID0 over RAID1 pairs, or linear over RAID1 pairs.  This
> puts all of your cores in play for writes.
> 
> The reason this is an issue is that even a small number of SSDs can
> overwhelm a single md thread, which is limited to one core of
> throughput.  This has also been discussed thoroughly here recently.
> 
>> The real questions are:
>> 1) Is this data enough to say that the performance issue is due to
>> underlying hardware as opposed to a mis-configuration?
> 
> No, it's not.  We really need to have more specific workload data.

Will the graphs showing read/write values assist with this? If not, what
data can I provide which would help? I'm collecting every value in the
/sys/block/<device>/stat file already, on 10sec intervals...

I have another system which collects and graphs the overall CPU load,
during the day this sits around 0.5, so this could certainly be related?
I know loadavg is not so useful, what would be the best tool/number to
watch to determine this?

>> 2) If so, any suggestions on specific hardware which would help?
> 
> It's not a hardware problem.  Given that it's a VM consolidation host,
> I'd guess it's a hypervisor configuration problem.

Possibly, but since the issue appears to be disk IO performance, and
since the various stats I've been collecting seemed to indicate the
backlog was happening at the disk level, I thought this pointed to the
problem. It is pretty challenging to diagnose all of the layers at the
same time! So I started at the bottom...

>> It is possibly to wipe the array and re-create that would help.......
> 
> Unless you're write IOPS starved due to md/RAID5 as I described above,
> blowing away the array and creating a new one isn't going to help.  You
> simply need to investigate further.

So how do I know if I'm write IOPS starved? Is there a number somewhere
which can be used to identify this?

> And if you would like continued assistance, you'd need to provide much
> greater detail of the hardware and workload.  You didn't mention your
> CPU(s) model/freq.  This matters greatly with RAID5 and SSD.  Nor RAM
> type/capacity, network topology, nor number of users and what
> applications they're running when they report the performance problem.
> Nor did you mention which hypervisor kernel/distro you're using, how
> many Windows VMs you're running, and the primary workload of each, etc,
> etc, etc.

Storage server:
Intel S1200BTLR Serverboard - LGA1155 / 4xDDR3 / 6xSATAII / Raid / 2xGbE
Intel Xeon E3-1230V2/3.3GHz/8MB CACHE/4CORE/8THREAD/5GTs/LGA1155 CPU
Kingston 4GB 1333MHz DDR3 ECC CL9 240PIN DIMM with Thermal Sensor
(Intel) (times 2, total 8G RAM)
Intel Dual Port Server PCIe v2.0 Adapter, I350-T2, Rj-45 Copper,
Low Profile & Full Height I350T2 (time 2, total 6 x 1G ethernet)
5 x Intel 520s MLC 480G SATA3
Debian Stable with DRBD 8.3.15, kernel 2.6.32-5-amd64

Physical Servers:
AMD PHENOM II X6 1100T 6CORE/3.3GHz/9MB CACHE/125W/Black Edition
Asus M5A88-M AMD Socket AM3+ Motherboard
Kingston 4GB 1333MHz DDR3 NON-ECC CL9 240pin KVR1333D3N9/4G (total 32G)
Intel 40GB X25M 320 Series SSD Drive
Debian Testing, kernel 3.2.0-4-amd64 and Xen 4.1

All servers connected to:
Netgear GS716T - 16 Port Gigabit Ethernet Switch


The problems are somewhat intermittent, sometimes the users don't even
advise of the problems. Basically, on occasion, when a user copies a
large file from disk to disk, or when a user is using Outlook
(frequently data files are over 2G), or just general workload, the
system will "stall", sometimes causing user level errors, which mostly
affects Outlook.

>> Sorry, graphs can be seen at:
>> http://203.98.89.64/graphs/
> The graphs tell us nothing in isolation Adam.  What is needed is to know
> what workloads are running when the device queues and response times
> start piling up.  Whether rust or SSD, queues will increase, as well as
> IO completion times, when sufficient IO is being performed.

I've updated the graphs and added additional graphs as above. I'm still
not sure exactly what data I can provide. The main thing I do know is
that at some random time (a number of times throughout a normal working
day) a number of users will report issues of freezing, and sometimes
this is over an extended period of time (10 minutes), usually I can see
a large amount of read/write activity which corresponds at the same time
(though I don't know specifically what application or user caused the
read/write activity). I can only imagine copying a 50G file from
directory A to directory B (on the same disk) would be equivalent to
generate the load, and all the other random normal IO at the same time.

So, is there any further information I should provide? What hard numbers
can I look at to guide me?

Regards,
Adam

-- 
Adam Goryachev
Website Managers
www.websitemanagers.com.au

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance
  2013-02-07  7:02 ` Carsten Aulbert
@ 2013-02-07 10:12   ` Adam Goryachev
  2013-02-07 10:29     ` Carsten Aulbert
  0 siblings, 1 reply; 131+ messages in thread
From: Adam Goryachev @ 2013-02-07 10:12 UTC (permalink / raw)
  To: Carsten Aulbert; +Cc: linux-raid

On 07/02/13 18:02, Carsten Aulbert wrote:
> Hi Adam
> 
> On 02/07/2013 07:48 AM, Adam Goryachev wrote:
>>
>> Each drive only has a single partition, and is partitioned a little
>> smaller than the drive (supposedly this should improve performance).
>> Each drive is set to the deadline scheduler.
> 
> First, I'd start with the deadline scheduler.

Already using it, am going to try noop shortly....

> Even for rotating rust systems you can improve performance quite a bit,
> but with SSDs you might hit the limits way too early and need to tune it.
> 
> Save the current default values somewhere safe, so you can go back
> easily, then start tuning these values (as a start):
> 
> for i in $(grep \^md /proc/mdstat |cut -d' ' -f 5-); do
>     DEV=${i:0:3}
>     echo deadline > /sys/block/$DEV/queue/scheduler
>     echo 4096  > /sys/block/$DEV/queue/nr_requests
>     echo 8192  > /sys/block/$DEV/queue/read_ahead_kb
>     echo 5000  > /sys/block/$DEV/queue/iosched/read_expire
>     echo 1000 > /sys/block/$DEV/queue/iosched/write_expire
>     echo 2048  > /sys/block/$DEV/queue/iosched/fifo_batch
> done
> 
> At least setting these helped quite a bit.

Do you have any information on what your workload is, or how/why these
values might help?

You are changing values significantly from the default, and I am
cautious that they may cause other issues. Also, someone else has
advised to reduce nr_requests rather than increasing it?

Thank you for your advice.

Regards,
Adam

-- 
Adam Goryachev
Website Managers
www.websitemanagers.com.au

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance
  2013-02-07  9:07 ` Dave Cundiff
@ 2013-02-07 10:19   ` Adam Goryachev
  2013-02-07 11:07     ` Dave Cundiff
  2013-02-07 12:01     ` Brad Campbell
  2013-02-11 19:49   ` Roy Sigurd Karlsbakk
  1 sibling, 2 replies; 131+ messages in thread
From: Adam Goryachev @ 2013-02-07 10:19 UTC (permalink / raw)
  To: Dave Cundiff; +Cc: linux-raid

On 07/02/13 20:07, Dave Cundiff wrote:
> On Thu, Feb 7, 2013 at 1:48 AM, Adam Goryachev
> <mailinglists@websitemanagers.com.au> wrote:
> Why would you plug thousands of dollars of SSD into an onboard
> controller? It's probably running off a 1x PCIE shared with every
> other onboard device. An LSI 8x 8 port HBA will run you a few
> hundred(less than 1 SSD) and let you melt your northbridge. At least
> on my Supermicro X8DTL boards I had to add active cooling to it or it
> would overheat and crash at sustained IO. I can hit 2 - 2.5GB a second
> doing large sequential IO with Samsung 840 Pros on a RAID10.

Because originally I was just using 4 x 2TB 7200 rpm disks in RAID10, I
upgraded to SSD to improve performance (which it did), but hadn't (yet)
upgraded the SATA controller because I didn't know if it would help.

I'm seeing conflicting information here (buy SATA card or not)...

>> 2) Move from a 5 disk RAID5 to a 8 disk RAID10, giving better data
>> protection (can lose up to four drives) and hopefully better performance
>> (main concern right now), and same capacity as current.
> 
> I've had strange issues with anything other than RAID1 or 10 with SSD.
> Even with the high IO and IOP rates of SSDs the parity calcs and extra
> writes still seem to penalize you greatly.

Maybe this is the single threaded nature of RAID5 (and RAID10) ?

> Also if your kernel does not have md TRIM support you risk taking a
> SEVERE performance hit on writes. Once you complete a full write pass
> on your NAND the SSD controller will require extra time to complete a
> write. if your IO is mostly small and random this can cause your NAND
> to become fragmented. If the fragmentation becomes bad enough you'll
> be lucky to get 1 spinning disk worth of write IO out of all 5
> combined.

This was the reason I made the partition (for raid) smaller than the
disk, and left the rest un-partitioned. However, as you said, once I've
fully written enough data to fill the raw disk capacity, I still have a
problem. Is there some way to instruct the disk (overnight) to TRIM the
extra blank space, and do whatever it needs to tidy things up? Perhaps
this would help, at least first thing in the morning if it isn't enough
to get through the day. Potentially I could add a 6th SSD, reduce the
partition size across all of them, just so there is more blank space to
get through a full day worth of writes?

Regards,
Adam

-- 
Adam Goryachev
Website Managers
www.websitemanagers.com.au

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance
  2013-02-07 10:12   ` Adam Goryachev
@ 2013-02-07 10:29     ` Carsten Aulbert
  2013-02-07 10:41       ` Adam Goryachev
  0 siblings, 1 reply; 131+ messages in thread
From: Carsten Aulbert @ 2013-02-07 10:29 UTC (permalink / raw)
  To: Adam Goryachev; +Cc: linux-raid

Hi

On 02/07/2013 11:12 AM, Adam Goryachev wrote:
> Do you have any information on what your workload is, or how/why these
> values might help?

Most of our data servers are exporting data read-only via NFS. It helped
well on these but more on download servers which allowed us to go from
200k 4MB downloads per day to 1M/day while adding new files constantly.

> 
> You are changing values significantly from the default, and I am
> cautious that they may cause other issues. Also, someone else has
> advised to reduce nr_requests rather than increasing it?
> 

I know, but these larger queues really helped a lot in re-ordering
requests to better match the hardware underneath - but again, this was
for hard drives with physical arms and not SSDs

Cheers

Carsten

-- 
Dr. Carsten Aulbert - Max Planck Institute for Gravitational Physics
Callinstrasse 38, 30167 Hannover, Germany
phone/fax: +49 511 762-17185 / -17193
https://wiki.atlas.aei.uni-hannover.de/foswiki/bin/view/ATLAS/WebHome

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance
  2013-02-07 10:29     ` Carsten Aulbert
@ 2013-02-07 10:41       ` Adam Goryachev
  0 siblings, 0 replies; 131+ messages in thread
From: Adam Goryachev @ 2013-02-07 10:41 UTC (permalink / raw)
  To: Carsten Aulbert; +Cc: linux-raid

On 07/02/13 21:29, Carsten Aulbert wrote:
> Hi
> 
> On 02/07/2013 11:12 AM, Adam Goryachev wrote:
>> Do you have any information on what your workload is, or how/why these
>> values might help?
> 
> Most of our data servers are exporting data read-only via NFS. It helped
> well on these but more on download servers which allowed us to go from
> 200k 4MB downloads per day to 1M/day while adding new files constantly.
> 
>>
>> You are changing values significantly from the default, and I am
>> cautious that they may cause other issues. Also, someone else has
>> advised to reduce nr_requests rather than increasing it?
>>
> 
> I know, but these larger queues really helped a lot in re-ordering
> requests to better match the hardware underneath - but again, this was
> for hard drives with physical arms and not SSDs

OK, this makes a lot more sense then :) Definitely slowing things
down/making the opportunity to re-order requests should help a lot for
spinning disks... I don't see that helping much at all with SSD's though.

Thanks,
Adam


-- 
Adam Goryachev
Website Managers
www.websitemanagers.com.au

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance
  2013-02-07 10:19   ` Adam Goryachev
@ 2013-02-07 11:07     ` Dave Cundiff
  2013-02-07 12:49       ` Adam Goryachev
  2013-02-08  3:32       ` RAID performance Stan Hoeppner
  2013-02-07 12:01     ` Brad Campbell
  1 sibling, 2 replies; 131+ messages in thread
From: Dave Cundiff @ 2013-02-07 11:07 UTC (permalink / raw)
  To: Adam Goryachev; +Cc: linux-raid

On Thu, Feb 7, 2013 at 5:19 AM, Adam Goryachev
<mailinglists@websitemanagers.com.au> wrote:
> On 07/02/13 20:07, Dave Cundiff wrote:
>> On Thu, Feb 7, 2013 at 1:48 AM, Adam Goryachev
>> <mailinglists@websitemanagers.com.au> wrote:
>> Why would you plug thousands of dollars of SSD into an onboard
>> controller? It's probably running off a 1x PCIE shared with every
>> other onboard device. An LSI 8x 8 port HBA will run you a few
>> hundred(less than 1 SSD) and let you melt your northbridge. At least
>> on my Supermicro X8DTL boards I had to add active cooling to it or it
>> would overheat and crash at sustained IO. I can hit 2 - 2.5GB a second
>> doing large sequential IO with Samsung 840 Pros on a RAID10.
>
> Because originally I was just using 4 x 2TB 7200 rpm disks in RAID10, I
> upgraded to SSD to improve performance (which it did), but hadn't (yet)
> upgraded the SATA controller because I didn't know if it would help.
>
> I'm seeing conflicting information here (buy SATA card or not)...

Its not going to help your remote access any. From your configuration
it looks like you are limited to 4 gigabits. At least as long as your
NICs are not in the slot shared with the disks. If they are you might
get some contention.

http://download.intel.com/support/motherboards/server/sb/g13326004_s1200bt_tps_r2_0.pdf

See page 17 for a block diagram of your motherboard. You have a 4x DMI
connection that PCI slot 3, your disks, and every other onboard device
share. That should be about 1.2GB(10Gigabits) of bandwidth. Your SSDs
alone could saturate that if you performed a local operation. Get your
NIC's going at 4Gig and all of it a sudden you'll really want that
SATA card in slot 4 or 5.

>
>>> 2) Move from a 5 disk RAID5 to a 8 disk RAID10, giving better data
>>> protection (can lose up to four drives) and hopefully better performance
>>> (main concern right now), and same capacity as current.
>>
>> I've had strange issues with anything other than RAID1 or 10 with SSD.
>> Even with the high IO and IOP rates of SSDs the parity calcs and extra
>> writes still seem to penalize you greatly.
>
> Maybe this is the single threaded nature of RAID5 (and RAID10) ?

I definitely see that. See below for a FIO run I just did on one of my RAID10s

md2 : active raid10 sdb3[1] sdf3[5] sde3[4] sdc3[2] sdd3[3] sda3[0]
      742343232 blocks super 1.2 32K chunks 2 near-copies [6/6] [UUUUUU]

seq-read: (g=0): rw=read, bs=64K-64K/64K-64K/64K-64K, ioengine=libaio,
iodepth=32
seq-write: (g=2): rw=write, bs=64K-64K/64K-64K/64K-64K,
ioengine=libaio, iodepth=32

Run status group 0 (all jobs):
   READ: io=4096.0MB, aggrb=2149.3MB/s, minb=2149.3MB/s,
maxb=2149.3MB/s, mint=1906msec, maxt=1906msec

Run status group 2 (all jobs):
  WRITE: io=4096.0MB, aggrb=1168.7MB/s, minb=1168.7MB/s,
maxb=1168.7MB/s, mint=3505msec, maxt=3505msec

These drives are pretty fresh and my writes are a whole gig less than
my read. Its not for lack of bandwidth either.

>
>> Also if your kernel does not have md TRIM support you risk taking a
>> SEVERE performance hit on writes. Once you complete a full write pass
>> on your NAND the SSD controller will require extra time to complete a
>> write. if your IO is mostly small and random this can cause your NAND
>> to become fragmented. If the fragmentation becomes bad enough you'll
>> be lucky to get 1 spinning disk worth of write IO out of all 5
>> combined.
>
> This was the reason I made the partition (for raid) smaller than the
> disk, and left the rest un-partitioned. However, as you said, once I've
> fully written enough data to fill the raw disk capacity, I still have a
> problem. Is there some way to instruct the disk (overnight) to TRIM the
> extra blank space, and do whatever it needs to tidy things up? Perhaps
> this would help, at least first thing in the morning if it isn't enough
> to get through the day. Potentially I could add a 6th SSD, reduce the
> partition size across all of them, just so there is more blank space to
> get through a full day worth of writes?

There was a script called mdtrim that would use hdparm to manually
send the proper TRIM commands to the drives. I didn't bother looking
for a link because it scares me to death and you probably shouldn't
use it. If it gets the math wrong random data will disappear from your
disks.

As for changing partition sizes you really have to know what kinds of
IO you're doing. If all you're doing is hammering these things with
tiny IOs 24x7 its gonna end up with terrible write IO. At least my
SSDs do. If you have a decent mix of small and large it may not
fragment as badly. I ran random 4k against mine for 2 days before it
got miserably slow. Reading will always be fine.


--
Dave Cundiff
System Administrator
A2Hosting, Inc
http://www.a2hosting.com

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance
  2013-02-07  6:48 RAID performance Adam Goryachev
                   ` (3 preceding siblings ...)
  2013-02-07  9:07 ` Dave Cundiff
@ 2013-02-07 11:32 ` Mikael Abrahamsson
  4 siblings, 0 replies; 131+ messages in thread
From: Mikael Abrahamsson @ 2013-02-07 11:32 UTC (permalink / raw)
  To: Adam Goryachev; +Cc: linux-raid

On Thu, 7 Feb 2013, Adam Goryachev wrote:

> Any comments, suggestions, advice greatly received.

I'm interested in the output from "iostat -x 5" from when you see the 
problem.

-- 
Mikael Abrahamsson    email: swmike@swm.pp.se

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance
  2013-02-07 10:19   ` Adam Goryachev
  2013-02-07 11:07     ` Dave Cundiff
@ 2013-02-07 12:01     ` Brad Campbell
  2013-02-07 12:37       ` Adam Goryachev
  1 sibling, 1 reply; 131+ messages in thread
From: Brad Campbell @ 2013-02-07 12:01 UTC (permalink / raw)
  To: Adam Goryachev; +Cc: linux-raid

On 07/02/13 18:19, Adam Goryachev wrote:

> problem. Is there some way to instruct the disk (overnight) to TRIM the
> extra blank space, and do whatever it needs to tidy things up? Perhaps
> this would help, at least first thing in the morning if it isn't enough
> to get through the day. Potentially I could add a 6th SSD, reduce the
> partition size across all of them, just so there is more blank space to
> get through a full day worth of writes?

I have 6 SSD's in a RAID10, and with 3.7.x (I forget which x - 2 or 3 
from memory) md will pass the TRIM down to the underlying devices (at 
least for RAID10 and from memory 1).

I have a cronjob that runs at midnight :

#!/bin/sh
export TIME="%E real\n%U user\n%S sys\n"
for i in / /home /raid10 ; do
	/usr/bin/time /home/brad/bin/fstrim -v $i
done

Based on the run times, and the bytes trimmed count I suspect it works.
All filesystems are ext4. Two of them are passed through encryption, but 
that passes TRIM down also. I do not have the discard option on any 
mounts (that way lies severe performance issues).

Regards,
Brad

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance
  2013-02-07 12:01     ` Brad Campbell
@ 2013-02-07 12:37       ` Adam Goryachev
  2013-02-07 17:12         ` Fredrik Lindgren
  0 siblings, 1 reply; 131+ messages in thread
From: Adam Goryachev @ 2013-02-07 12:37 UTC (permalink / raw)
  To: Brad Campbell; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 2780 bytes --]

On 07/02/13 23:01, Brad Campbell wrote:
> On 07/02/13 18:19, Adam Goryachev wrote:
> 
>> problem. Is there some way to instruct the disk (overnight) to TRIM the
>> extra blank space, and do whatever it needs to tidy things up? Perhaps
>> this would help, at least first thing in the morning if it isn't enough
>> to get through the day. Potentially I could add a 6th SSD, reduce the
>> partition size across all of them, just so there is more blank space to
>> get through a full day worth of writes?
> 
> I have 6 SSD's in a RAID10, and with 3.7.x (I forget which x - 2 or 3
> from memory) md will pass the TRIM down to the underlying devices (at
> least for RAID10 and from memory 1).

Yes, I have read that the very new kernel has those patches, but I'm on
2.6.x at the moment, and in addition, see below why they wouldn't help
anyway...

> I have a cronjob that runs at midnight :
> Based on the run times, and the bytes trimmed count I suspect it works.
> All filesystems are ext4. Two of them are passed through encryption, but
> that passes TRIM down also. I do not have the discard option on any
> mounts (that way lies severe performance issues).

I don't have any FS on this RAID, it is like this:
5 x SSD
RAID5 (doesn't support TRIM, though I've seen some patches but I think
they are not included in any kernel yet).
DRBD (doubt this supports TRIM
LVM (don't think it supports TRIM, maybe in newer kernel)
iSCSI (don't think it support TRIM
Windows 2003 and Windows 2000 (don't think it supports TRIM)

So, really, all I want to do is use TRIM on the portion of the drive
which is not partitioned at all, and I suspect the SSD knows that
section is available, but how do I tell the drive "please go and do a
cleanup now, because the users are all sleeping"?

BTW, I just created a small LV (15G) and ran a couple of write tests
(well, not proper one, but at least you get some idea how bad things are).
dd if=/dev/zero of=/dev/vg0/testlv oflag=direct bs=16k count=50k
^C50695+0 records in
50695+0 records out
830586880 bytes (831 MB) copied, 99.4635 s, 8.4 MB/s

I killed it after waiting a while....  this is while most of the systems
are idle, except one which is currently being backed up (lots of reads,
small number of writes). This is indicative of IO starvation though, I
would have expected a significantly higher write performance?

While I was running the dd, I ran a iostat -x 5 in another session:

See text file attached for output, as seems to want to line wrap because
it is too wide....

dm-13 is the client (windows 2003) which is currently being backed up,
dm-14 is the testlv I'm writing to from the localhost.

Suggestions on better testing methods or is this expected?

Thanks,
Adam

-- 
Adam Goryachev
Website Managers
www.websitemanagers.com.au

[-- Attachment #2: stats.txt --]
[-- Type: text/plain, Size: 3499 bytes --]

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00     0.00    0.00    0.20     0.00     1.60     8.00     0.00    0.00   0.00   0.00
sdb               0.00     0.00  500.80  881.00  4648.20  7029.00     8.45     0.12    0.09   0.03   3.76
sdc               0.00     0.00  509.60  895.20  4676.20  7142.60     8.41     0.11    0.08   0.03   4.08
sdd               0.00     0.00  520.00  900.60  4753.80  7185.80     8.40     0.11    0.07   0.03   4.00
sde               0.00     0.00  528.80  922.80  4831.20  7363.40     8.40     0.16    0.11   0.04   5.20
sdf               0.60     0.00  515.00  906.60  4744.00  7233.80     8.43     0.14    0.10   0.04   5.04
md1               0.00     0.00  200.60  601.40  4659.80 17917.60    28.15     0.00    0.00   0.00   0.00
drbd2             0.00     0.00  200.60  599.80  4659.80 17916.00    28.21     1.15    1.45   1.25  99.84
dm-1              0.00     0.00   29.00    4.60   464.00    67.40    15.82     0.01    0.26   0.21   0.72
dm-5              0.00     0.00    0.20    2.80     1.60    13.40     5.00     0.01    2.40   1.60   0.48
dm-13             0.00     0.00  170.20   67.40  4184.60  1129.60    22.37     0.14    0.58   0.38   8.96
dm-21             0.00     0.00    0.00    0.60     0.00     8.00    13.33     0.00    0.00   0.00   0.00
dm-29             0.00     0.00    0.20    0.80     1.60     9.60    11.20     0.00    0.80   0.80   0.08
dm-37             0.00     0.00    0.20    0.00     1.60     0.00     8.00     0.00    0.00   0.00   0.00
dm-45             0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
dm-53             0.00     0.00    0.00    1.20     0.00     9.60     8.00     0.00    1.33   0.67   0.08
dm-61             0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
dm-65             0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
dm-71             0.00     0.00    0.20    0.80     1.60    11.20    12.80     0.00    1.60   1.60   0.16
dm-9              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
dm-15             0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
dm-2              0.00     0.00    0.40    1.20     3.20     9.60     8.00     0.00    0.00   0.00   0.00
dm-11             0.00     0.00    0.40    2.80     3.20    13.40     5.19     0.01    2.00   1.25   0.40
dm-0              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
dm-3              0.00     0.00    0.00    0.60     0.00     4.80     8.00     0.00    2.67   1.33   0.08
dm-4              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
dm-6              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
dm-7              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
dm-8              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
dm-10             0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
dm-12             0.00     0.00    0.00    0.40     0.00     3.20     8.00     0.00    2.00   2.00   0.08
dm-14             0.00     0.00    0.00  520.60     0.00 16659.20    32.00     1.00    1.92   1.92 100.00


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance
  2013-02-07 11:07     ` Dave Cundiff
@ 2013-02-07 12:49       ` Adam Goryachev
  2013-02-07 12:53         ` Phil Turmel
  2013-02-07 15:32         ` Dave Cundiff
  2013-02-08  3:32       ` RAID performance Stan Hoeppner
  1 sibling, 2 replies; 131+ messages in thread
From: Adam Goryachev @ 2013-02-07 12:49 UTC (permalink / raw)
  To: Dave Cundiff; +Cc: linux-raid

On 07/02/13 22:07, Dave Cundiff wrote:
> On Thu, Feb 7, 2013 at 5:19 AM, Adam Goryachev
> <mailinglists@websitemanagers.com.au> wrote:
>> On 07/02/13 20:07, Dave Cundiff wrote:
>>> On Thu, Feb 7, 2013 at 1:48 AM, Adam Goryachev
>>> <mailinglists@websitemanagers.com.au> wrote:
>>> Why would you plug thousands of dollars of SSD into an onboard
>>> controller? It's probably running off a 1x PCIE shared with every
>>> other onboard device. An LSI 8x 8 port HBA will run you a few
>>> hundred(less than 1 SSD) and let you melt your northbridge. At least
>>> on my Supermicro X8DTL boards I had to add active cooling to it or it
>>> would overheat and crash at sustained IO. I can hit 2 - 2.5GB a second
>>> doing large sequential IO with Samsung 840 Pros on a RAID10.
>>
>> Because originally I was just using 4 x 2TB 7200 rpm disks in RAID10, I
>> upgraded to SSD to improve performance (which it did), but hadn't (yet)
>> upgraded the SATA controller because I didn't know if it would help.
>>
>> I'm seeing conflicting information here (buy SATA card or not)...
> 
> Its not going to help your remote access any. From your configuration
> it looks like you are limited to 4 gigabits. At least as long as your
> NICs are not in the slot shared with the disks. If they are you might
> get some contention.
> 
> http://download.intel.com/support/motherboards/server/sb/g13326004_s1200bt_tps_r2_0.pdf
> 
> See page 17 for a block diagram of your motherboard. You have a 4x DMI
> connection that PCI slot 3, your disks, and every other onboard device
> share. That should be about 1.2GB(10Gigabits) of bandwidth. Your SSDs
> alone could saturate that if you performed a local operation. Get your
> NIC's going at 4Gig and all of it a sudden you'll really want that
> SATA card in slot 4 or 5.

OK, I'll have to check that the 4 x 1G ethernet are in slots 4 and 5
now, not using the onboard ethernet, and not in slot 3...

If I could get close to 4Gbps (ie, saturate the ethernet) then I think
I'd be more than happy... I don't see my SSD's running at 400MB/s though
anyway....

>>>> 2) Move from a 5 disk RAID5 to a 8 disk RAID10, giving better data
>>>> protection (can lose up to four drives) and hopefully better performance
>>>> (main concern right now), and same capacity as current.
>>>
>>> I've had strange issues with anything other than RAID1 or 10 with SSD.
>>> Even with the high IO and IOP rates of SSDs the parity calcs and extra
>>> writes still seem to penalize you greatly.
>>
>> Maybe this is the single threaded nature of RAID5 (and RAID10) ?
> 
> I definitely see that. See below for a FIO run I just did on one of my RAID10s
> 
> md2 : active raid10 sdb3[1] sdf3[5] sde3[4] sdc3[2] sdd3[3] sda3[0]
>       742343232 blocks super 1.2 32K chunks 2 near-copies [6/6] [UUUUUU]
> 
> seq-read: (g=0): rw=read, bs=64K-64K/64K-64K/64K-64K, ioengine=libaio,
> iodepth=32
> seq-write: (g=2): rw=write, bs=64K-64K/64K-64K/64K-64K,
> ioengine=libaio, iodepth=32
> 
> Run status group 0 (all jobs):
>    READ: io=4096.0MB, aggrb=2149.3MB/s, minb=2149.3MB/s,
> maxb=2149.3MB/s, mint=1906msec, maxt=1906msec
> 
> Run status group 2 (all jobs):
>   WRITE: io=4096.0MB, aggrb=1168.7MB/s, minb=1168.7MB/s,
> maxb=1168.7MB/s, mint=3505msec, maxt=3505msec
> 
> These drives are pretty fresh and my writes are a whole gig less than
> my read. Its not for lack of bandwidth either.

Can you please show your command line used, so I can try a similar test
and see a comparison?

>>> Also if your kernel does not have md TRIM support you risk taking a
>>> SEVERE performance hit on writes. Once you complete a full write pass
>>> on your NAND the SSD controller will require extra time to complete a
>>> write. if your IO is mostly small and random this can cause your NAND
>>> to become fragmented. If the fragmentation becomes bad enough you'll
>>> be lucky to get 1 spinning disk worth of write IO out of all 5
>>> combined.
>>
>> This was the reason I made the partition (for raid) smaller than the
>> disk, and left the rest un-partitioned. However, as you said, once I've
>> fully written enough data to fill the raw disk capacity, I still have a
>> problem. Is there some way to instruct the disk (overnight) to TRIM the
>> extra blank space, and do whatever it needs to tidy things up? Perhaps
>> this would help, at least first thing in the morning if it isn't enough
>> to get through the day. Potentially I could add a 6th SSD, reduce the
>> partition size across all of them, just so there is more blank space to
>> get through a full day worth of writes?
> 
> There was a script called mdtrim that would use hdparm to manually
> send the proper TRIM commands to the drives. I didn't bother looking
> for a link because it scares me to death and you probably shouldn't
> use it. If it gets the math wrong random data will disappear from your
> disks.

Doesn't sound good... would be nice to use smartctl or similar to ask
the drive "please tidy up now". The drive itself knows that the
unpartitioned space is available.

> As for changing partition sizes you really have to know what kinds of
> IO you're doing. If all you're doing is hammering these things with
> tiny IOs 24x7 its gonna end up with terrible write IO. At least my
> SSDs do. If you have a decent mix of small and large it may not
> fragment as badly. I ran random 4k against mine for 2 days before it
> got miserably slow. Reading will always be fine.

Well, if I can re-trim daily, and have enough clean space to work for 2
days, then I should never hit this problem.... Assuming it loses *that
much* performance....

Thanks,
Adam


-- 
Adam Goryachev
Website Managers
www.websitemanagers.com.au

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance
  2013-02-07 12:49       ` Adam Goryachev
@ 2013-02-07 12:53         ` Phil Turmel
  2013-02-07 12:58           ` Adam Goryachev
  2013-02-07 15:32         ` Dave Cundiff
  1 sibling, 1 reply; 131+ messages in thread
From: Phil Turmel @ 2013-02-07 12:53 UTC (permalink / raw)
  To: Adam Goryachev; +Cc: Dave Cundiff, linux-raid

On 02/07/2013 07:49 AM, Adam Goryachev wrote:
> Well, if I can re-trim daily, and have enough clean space to work for 2
> days, then I should never hit this problem.... Assuming it loses *that
> much* performance....

You keep saying this, but you are only going to be trimming the
unpartitioned space.  That won't help you after the first trim.

Phil


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance
  2013-02-07 12:53         ` Phil Turmel
@ 2013-02-07 12:58           ` Adam Goryachev
  2013-02-07 13:03             ` Phil Turmel
  0 siblings, 1 reply; 131+ messages in thread
From: Adam Goryachev @ 2013-02-07 12:58 UTC (permalink / raw)
  To: Phil Turmel; +Cc: Dave Cundiff, linux-raid

On 07/02/13 23:53, Phil Turmel wrote:
> On 02/07/2013 07:49 AM, Adam Goryachev wrote:
>> Well, if I can re-trim daily, and have enough clean space to work for 2
>> days, then I should never hit this problem.... Assuming it loses *that
>> much* performance....
> 
> You keep saying this, but you are only going to be trimming the
> unpartitioned space.  That won't help you after the first trim.


Right, so if I TRIM it each night, then the next day, there is a bunch
of freshly TRIM'ed space to use up. As long as there is enough for the
day's writes, then I won't face this issue of slow writes while the SSD
is trying to do garbage collection or whatever...

Or, I'm just dreaming :)

Or, this isn't my problem, so I should just ignore it....

Thanks,
Adam


-- 
Adam Goryachev
Website Managers
www.websitemanagers.com.au

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance
  2013-02-07 12:58           ` Adam Goryachev
@ 2013-02-07 13:03             ` Phil Turmel
  2013-02-07 13:08               ` Adam Goryachev
  2013-02-07 22:03               ` Chris Murphy
  0 siblings, 2 replies; 131+ messages in thread
From: Phil Turmel @ 2013-02-07 13:03 UTC (permalink / raw)
  To: Adam Goryachev; +Cc: Dave Cundiff, linux-raid

On 02/07/2013 07:58 AM, Adam Goryachev wrote:
> On 07/02/13 23:53, Phil Turmel wrote:
>> On 02/07/2013 07:49 AM, Adam Goryachev wrote:
>>> Well, if I can re-trim daily, and have enough clean space to work for 2
>>> days, then I should never hit this problem.... Assuming it loses *that
>>> much* performance....
>>
>> You keep saying this, but you are only going to be trimming the
>> unpartitioned space.  That won't help you after the first trim.
> 
> 
> Right, so if I TRIM it each night, then the next day, there is a bunch
> of freshly TRIM'ed space to use up. As long as there is enough for the
> day's writes, then I won't face this issue of slow writes while the SSD
> is trying to do garbage collection or whatever...

Trim is a per-sector property.  Once trimmed, it stays trimmed until you
write to *that* sector.

> Or, I'm just dreaming :)

You are dreaming.  If you want to achieve something with trim, you would
trim on your clients and hope it passes all the way down the stack
through iSCSI to your LVs and then to MD and the SSDs.

> Or, this isn't my problem, so I should just ignore it....

You just have nothing useful on the server side for trim to do.
(Although you should manually trim the unpartitioned space.  You only
need to do so once.)

Phil


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance
  2013-02-07 13:03             ` Phil Turmel
@ 2013-02-07 13:08               ` Adam Goryachev
  2013-02-07 13:20                 ` Mikael Abrahamsson
  2013-02-07 22:03               ` Chris Murphy
  1 sibling, 1 reply; 131+ messages in thread
From: Adam Goryachev @ 2013-02-07 13:08 UTC (permalink / raw)
  To: Phil Turmel; +Cc: Dave Cundiff, linux-raid

On 08/02/13 00:03, Phil Turmel wrote:
> On 02/07/2013 07:58 AM, Adam Goryachev wrote:
>> On 07/02/13 23:53, Phil Turmel wrote:
>>> On 02/07/2013 07:49 AM, Adam Goryachev wrote:
>>>> Well, if I can re-trim daily, and have enough clean space to work for 2
>>>> days, then I should never hit this problem.... Assuming it loses *that
>>>> much* performance....
> 
> You just have nothing useful on the server side for trim to do.
> (Although you should manually trim the unpartitioned space.  You only
> need to do so once.)

If the unpartitioned space has *never* been used/partitioned, then I
presume TRIM wouldn't help here then.... So, I guess I might as well
leave this one alone.

Though, like I said, would adding an extra SSD to the RAID5, and
reducing the size of all partitions by 20%, and then doing TRIM on that
newly freed space, would that improve performance because of the extra
free space the SSD can "work with" ?

Thanks again,
Adam

-- 
Adam Goryachev
Website Managers
www.websitemanagers.com.au

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance
  2013-02-07 13:08               ` Adam Goryachev
@ 2013-02-07 13:20                 ` Mikael Abrahamsson
  0 siblings, 0 replies; 131+ messages in thread
From: Mikael Abrahamsson @ 2013-02-07 13:20 UTC (permalink / raw)
  To: linux-raid

On Fri, 8 Feb 2013, Adam Goryachev wrote:

> Though, like I said, would adding an extra SSD to the RAID5, and 
> reducing the size of all partitions by 20%, and then doing TRIM on that 
> newly freed space, would that improve performance because of the extra 
> free space the SSD can "work with" ?

Yes, that probably doubles the overprovisioning on the drive, which should 
be beneficial in a lot of aspects.

<http://en.wikipedia.org/wiki/Garbage_collection_(SSD)#Over-provisioning>

-- 
Mikael Abrahamsson    email: swmike@swm.pp.se

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance
  2013-02-07 12:49       ` Adam Goryachev
  2013-02-07 12:53         ` Phil Turmel
@ 2013-02-07 15:32         ` Dave Cundiff
  2013-02-08 13:58           ` Adam Goryachev
  1 sibling, 1 reply; 131+ messages in thread
From: Dave Cundiff @ 2013-02-07 15:32 UTC (permalink / raw)
  To: Adam Goryachev; +Cc: linux-raid

On Thu, Feb 7, 2013 at 7:49 AM, Adam Goryachev
<mailinglists@websitemanagers.com.au> wrote:
>>
>> I definitely see that. See below for a FIO run I just did on one of my RAID10s
>>
>> md2 : active raid10 sdb3[1] sdf3[5] sde3[4] sdc3[2] sdd3[3] sda3[0]
>>       742343232 blocks super 1.2 32K chunks 2 near-copies [6/6] [UUUUUU]
>>
>> seq-read: (g=0): rw=read, bs=64K-64K/64K-64K/64K-64K, ioengine=libaio,
>> iodepth=32
>> seq-write: (g=2): rw=write, bs=64K-64K/64K-64K/64K-64K,
>> ioengine=libaio, iodepth=32
>>
>> Run status group 0 (all jobs):
>>    READ: io=4096.0MB, aggrb=2149.3MB/s, minb=2149.3MB/s,
>> maxb=2149.3MB/s, mint=1906msec, maxt=1906msec
>>
>> Run status group 2 (all jobs):
>>   WRITE: io=4096.0MB, aggrb=1168.7MB/s, minb=1168.7MB/s,
>> maxb=1168.7MB/s, mint=3505msec, maxt=3505msec
>>
>> These drives are pretty fresh and my writes are a whole gig less than
>> my read. Its not for lack of bandwidth either.
>
> Can you please show your command line used, so I can try a similar test
> and see a comparison?
>

Seeing the 8MB second write in your other post you've definitely got
something bound up somewhere. You're going to want to get as low as
possible in your stack to verify the SSDs themselves are not the
issue. If you have extra unpartitioned space try testing against that.
You can just manually TRIM it afterwards to cleanup. If you can get
decent IO against a single device you've got something wrong higher
up. You'll have to find ways to test against each layer to find your
issue.

The tool I used is FIO: http://freecode.com/projects/fio

The benchmark was pretty simple. Save the following to a file, adjust
directory to where the test should write, and run it with ./fio
[file]. You may need to adjust the size parameter smaller if your
system will have trouble creating a 4g file. The filename option can
also be a device if the directory option is removed. Just be very
careful as the write test is destructive if done to a device instead
of a filesystem.

[global]
bs=64k
ioengine=libaio
iodepth=32
size=4g
direct=1
runtime=60
directory=/root
filename=ssd.test.file

[seq-read]
rw=read
stonewall

[seq-write]
rw=write
stonewall


--
Dave Cundiff
System Administrator
A2Hosting, Inc
http://www.a2hosting.com

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance
  2013-02-07 12:37       ` Adam Goryachev
@ 2013-02-07 17:12         ` Fredrik Lindgren
  2013-02-08  0:00           ` Adam Goryachev
  0 siblings, 1 reply; 131+ messages in thread
From: Fredrik Lindgren @ 2013-02-07 17:12 UTC (permalink / raw)
  To: Adam Goryachev; +Cc: Brad Campbell, linux-raid

Hello,

On 02/07/2013 01:37 PM, Adam Goryachev wrote:
> On 07/02/13 23:01, Brad Campbell wrote:
>> On 07/02/13 18:19, Adam Goryachev wrote:
>>
>>> problem. Is there some way to instruct the disk (overnight) to TRIM the
>>> extra blank space, and do whatever it needs to tidy things up? Perhaps
>>> this would help, at least first thing in the morning if it isn't enough
>>> to get through the day. Potentially I could add a 6th SSD, reduce the
>>> partition size across all of them, just so there is more blank space to
>>> get through a full day worth of writes?
>> I have 6 SSD's in a RAID10, and with 3.7.x (I forget which x - 2 or 3
>> from memory) md will pass the TRIM down to the underlying devices (at
>> least for RAID10 and from memory 1).
> Yes, I have read that the very new kernel has those patches, but I'm on
> 2.6.x at the moment, and in addition, see below why they wouldn't help
> anyway...
>
>> I have a cronjob that runs at midnight :
>> Based on the run times, and the bytes trimmed count I suspect it works.
>> All filesystems are ext4. Two of them are passed through encryption, but
>> that passes TRIM down also. I do not have the discard option on any
>> mounts (that way lies severe performance issues).
> I don't have any FS on this RAID, it is like this:
> 5 x SSD
> RAID5 (doesn't support TRIM, though I've seen some patches but I think
> they are not included in any kernel yet).
> DRBD (doubt this supports TRIM
> LVM (don't think it supports TRIM, maybe in newer kernel)
> iSCSI (don't think it support TRIM
> Windows 2003 and Windows 2000 (don't think it supports TRIM)
>
> So, really, all I want to do is use TRIM on the portion of the drive
> which is not partitioned at all, and I suspect the SSD knows that
> section is available, but how do I tell the drive "please go and do a
> cleanup now, because the users are all sleeping"?
>
> BTW, I just created a small LV (15G) and ran a couple of write tests
> (well, not proper one, but at least you get some idea how bad things are).
> dd if=/dev/zero of=/dev/vg0/testlv oflag=direct bs=16k count=50k
> ^C50695+0 records in
> 50695+0 records out
> 830586880 bytes (831 MB) copied, 99.4635 s, 8.4 MB/s
>
> I killed it after waiting a while....  this is while most of the systems
> are idle, except one which is currently being backed up (lots of reads,
> small number of writes). This is indicative of IO starvation though, I
> would have expected a significantly higher write performance?
>
> While I was running the dd, I ran a iostat -x 5 in another session:
>
> See text file attached for output, as seems to want to line wrap because
> it is too wide....
>
> dm-13 is the client (windows 2003) which is currently being backed up,
> dm-14 is the testlv I'm writing to from the localhost.

 From the iostat output it seems quite clear that the culprit is the
drbd2 device. The /dev/sd[b-f] seems to have plenty more to give,
even though they're doing some 1400 iops each (which seems a
lot for the throughput you're seeing, why are the IOs towards the
physical disks so small?).

Regarding that drbd device, Is there some mirroring being done to
another machine by way of drbd? If so, with a sync-mirror to another
machine over the network 8,4Mb/s could be quite "normal", right?

Regards,
   Fredrik Lindgren


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance
  2013-02-07 13:03             ` Phil Turmel
  2013-02-07 13:08               ` Adam Goryachev
@ 2013-02-07 22:03               ` Chris Murphy
  2013-02-07 23:48                 ` Chris Murphy
  1 sibling, 1 reply; 131+ messages in thread
From: Chris Murphy @ 2013-02-07 22:03 UTC (permalink / raw)
  To: Phil Turmel; +Cc: Adam Goryachev, Dave Cundiff, linux-raid


On Feb 7, 2013, at 6:03 AM, Phil Turmel <philip@turmel.org> wrote:

> 
> Trim is a per-sector property.  Once trimmed, it stays trimmed until you
> write to *that* sector.

I agree, although I think technically it is an attribute of the page, since that's the smallest structure on an SSD for read/write. The page sizes appear to be either 4KB or 8KB, while the SSD basically lies to the OS and says its physical sector size is 512 bytes.

Once all pages in an SSD erase block are flagged for erase, which could be done with a SATA TRIM command or (dynamic or static) wear leveling by the SSD itself, then those pages can be erased.

I think what the SSD vendors  have done is, since all requests by an OS are in 4KB blocks, whether or not those LBA's are 4K aligned (like is required for 512e AF disks), they can stuff any 4KB fs block into a 4KB or 8KB page, aligned.

> If you want to achieve something with trim, you would
> trim on your clients and hope it passes all the way down the stack
> through iSCSI to your LVs and then to MD and the SSDs.


TRIM probably does reduce the need for the firmware to do its own static wear leveling, but I don't know if it's that significant except for large deletions. If it were the case that SSD's reliably returned zeros for unassigned LBAs (i.e. LBAs previously TRIM'd), there could be some optimization for the lowest level to translate page sized writes of only zeros into TRIM commands.

But for performance purposes, I don't see that it makes much difference. Over provisioning and dynamic wear leveling take care of the performance concern. And it seems to me the usual rules of thumbs for chunk size apply; maybe being a bit more on the conservative size (tending to smaller) makes more sense, to avoid large unnecessary RMW.

On the one hand a chunk size exactly sized and aligned with an SSD erase block might seem ideal; but while it might improve the efficiency of the SSD garbage collecting those blocks, it translates into higher wear.

> You just have nothing useful on the server side for trim to do.
> (Although you should manually trim the unpartitioned space.  You only
> need to do so once.)


It's unclear to me that user over provisioning is necessary. The SSD is already over-provisioned. I can see where a mismatch in usage could be a problem, e.g. enterprise patterns while using consumer SSDs.


Chris Murphy

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance
  2013-02-07 22:03               ` Chris Murphy
@ 2013-02-07 23:48                 ` Chris Murphy
  2013-02-08  0:02                   ` Chris Murphy
  2013-02-08  6:15                   ` Adam Goryachev
  0 siblings, 2 replies; 131+ messages in thread
From: Chris Murphy @ 2013-02-07 23:48 UTC (permalink / raw)
  To: Phil Turmel; +Cc: Adam Goryachev, Dave Cundiff, linux-raid


***On Feb 7, 2013, at 3:03 PM, Chris Murphy <lists@colorremedies.com> wrote:
> 
> TRIM probably does reduce the need for the firmware to do its own static wear leveling, but I don't know if it's that significant except for large deletions.

In this case it helps via dynamic over provisioning. Less over provisioning is needed if unused pages/blocks can be made available (erased).



***On Feb 7, 2013, at 6:08 AM, Adam Goryachev <mailinglists@websitemanagers.com.au> wrote:
> 
> Though, like I said, would adding an extra SSD to the RAID5, and
> reducing the size of all partitions by 20%, and then doing TRIM on that
> newly freed space, would that improve performance because of the extra
> free space the SSD can "work with" ?


It might help eventually, but I don't think this is the magic bullet you're looking for. Have you looked at network congestion when this problem is happening?

> Basically, on occasion, when a user copies a
> large file from disk to disk, or when a user is using Outlook
> (frequently data files are over 2G), or just general workload, the
> system will "stall", sometimes causing user level errors, which mostly
> affects Outlook.

Does this concern anyone else? In particular the user doing "disk to disk" large file copies. What is this exactly? LV to LV with iSCSI over 1gigE? Why did you reject NFS for these physical Windows boxes and their VMs to access this storage, rather than what I assume is NTFS over iSCSI, because of this statement?

> Each LV is then exported via iSCSI


That block device needs a file system for Windows to use it.

It also seems to me one or more of these physical servers running VMs, with only 1gigE to the storage server, need either additional pipes LACP or bonded ethernet, or 10gigE. I can just imagine one person doing a large file copy disk to disk, which is a single pipe doing a pull push, double NTFS packet overhead, while all other activities get immensely hit with network latency as a result.


###On Feb 7, 2013, at 4:07 AM, Dave Cundiff <syshackmin@gmail.com> wrote:

> See page 17 for a block diagram of your motherboard…
> Your SSDs
> alone could saturate that if you performed a local operation. Get your
> NIC's going at 4Gig and all of it a sudden you'll really want that
> SATA card in slot 4 or 5.

Yeah I think it needs all the network performance and reduced latency as he can get. I'll be surprised if the SSD tuning alone makes much of a dent with this.


Chris Murphy--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance
  2013-02-07 17:12         ` Fredrik Lindgren
@ 2013-02-08  0:00           ` Adam Goryachev
  0 siblings, 0 replies; 131+ messages in thread
From: Adam Goryachev @ 2013-02-08  0:00 UTC (permalink / raw)
  To: Fredrik Lindgren; +Cc: Brad Campbell, linux-raid

Fredrik Lindgren <fli@wirebound.net> wrote:

>Hello,
>
>On 02/07/2013 01:37 PM, Adam Goryachev wrote:
>> On 07/02/13 23:01, Brad Campbell wrote:
>>> On 07/02/13 18:19, Adam Goryachev wrote:
>> BTW, I just created a small LV (15G) and ran a couple of write tests
>> (well, not proper one, but at least you get some idea how bad things
>are).
>> dd if=/dev/zero of=/dev/vg0/testlv oflag=direct bs=16k count=50k
>> ^C50695+0 records in
>> 50695+0 records out
>> 830586880 bytes (831 MB) copied, 99.4635 s, 8.4 MB/s
>>
> From the iostat output it seems quite clear that the culprit is the
>drbd2 device. The /dev/sd[b-f] seems to have plenty more to give,
>even though they're doing some 1400 iops each (which seems a
>lot for the throughput you're seeing, why are the IOs towards the
>physical disks so small?).
>
>Regarding that drbd device, Is there some mirroring being done to
>another machine by way of drbd? If so, with a sync-mirror to another
>machine over the network 8,4Mb/s could be quite "normal", right?

Thank you for the smack, I realise now that I was stupidly doing this late at night, and as you suggested, the secondary comes online after business hours. I will need to redo my test after manually disconnecting the secondary server after hours.

Please ignore the above data, I will re-test tonight and advise.

Regards,
Adam

--
Adam Goryachev
Website Managers
www.websitemanagers.com.au

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance
  2013-02-07 23:48                 ` Chris Murphy
@ 2013-02-08  0:02                   ` Chris Murphy
  2013-02-08  6:25                     ` Adam Goryachev
  2013-02-08  6:15                   ` Adam Goryachev
  1 sibling, 1 reply; 131+ messages in thread
From: Chris Murphy @ 2013-02-08  0:02 UTC (permalink / raw)
  To: Phil Turmel; +Cc: Adam Goryachev, Dave Cundiff, linux-raid


On Feb 7, 2013, at 4:48 PM, Chris Murphy <lists@colorremedies.com> wrote:
> 
> ***On Feb 7, 2013, at 6:08 AM, Adam Goryachev <mailinglists@websitemanagers.com.au> wrote:
> 
>> Basically, on occasion, when a user copies a
>> large file from disk to disk,

> Why did you reject NFS …

A sign of brain deficiency, responding to myself…

Maybe this is an array to array transfer (?) and not LV to LV transfer where both LVs are on this SSD array? If it's the latter, geez, use NFS. This transaction is reduced to a rename (i.e. a move). Super fast. 

Even if it's the former, with different block devices on the same server, I'm pretty sure NFSv4 can reduce this to a cp between devices, rather than piping the data all the way to the client, just to have the client pipe it back to the server. Although I'm sure someone will correct me if I'm wrong.


Chris Murphy--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance
  2013-02-07 11:07     ` Dave Cundiff
  2013-02-07 12:49       ` Adam Goryachev
@ 2013-02-08  3:32       ` Stan Hoeppner
  2013-02-08  7:11         ` Adam Goryachev
  2013-02-08  7:17         ` Adam Goryachev
  1 sibling, 2 replies; 131+ messages in thread
From: Stan Hoeppner @ 2013-02-08  3:32 UTC (permalink / raw)
  To: Dave Cundiff; +Cc: Adam Goryachev, linux-raid

On 2/7/2013 5:07 AM, Dave Cundiff wrote:

> Its not going to help your remote access any. From your configuration
> it looks like you are limited to 4 gigabits. At least as long as your
> NICs are not in the slot shared with the disks. If they are you might
> get some contention.
> 
> http://download.intel.com/support/motherboards/server/sb/g13326004_s1200bt_tps_r2_0.pdf
> 
> See page 17 for a block diagram of your motherboard. You have a 4x DMI
> connection that PCI slot 3, your disks, and every other onboard device
> share. That should be about 1.2GB(10Gigabits) of bandwidth. 

This is not an issue.  The C204 to LGA1155 connection is 4 lane DMI 2.0,
not 1.0, so that's 40Gb/s and 5GB/s duplex, 2.5GB/s each way, which is
more than sufficient for his devices.

> Your SSDs
> alone could saturate that if you performed a local operation. 

See above.  However, using an LSI  9211-8i, or better yet a 9207-8i, in
SLOT6 would be more optimal:

1.  These board's ASICs are capable of 320K and 700K IOPS respectively.
 As good as it may be, the Intel C204 Southbridge SATA IO processor is
simply not in this league.  Whether it is a bottleneck in this case is
unknown at this time, but it's a possibility, as the C204 wasn't
designed with SSDs in mind.

2.  SLOT6 is PCIe x8 with 8GB/s bandwidth, 4GB/s each way, which can
handle the full bandwidth of 8 of these Intel 480GB SSDs.

> Get your
> NIC's going at 4Gig and all of it a sudden you'll really want that
> SATA card in slot 4 or 5.

Which brings me to the issue of the W2K DC that seems to be at the root
of the performance problems.  Adam mentioned one scenario, where a user
was copying a 50GB file from "one drive to another" through the Windows
DC.  That's a big load on any network, and would tie up both bonded GbE
links for quite a while.  All of these Windows machines are VM guests
whose local disks are apparently iSCSI targets on the server holding the
SSD md/RAID5 array.  This suggests a few possible causes:

1.  Ethernet interface saturation on Xen host under this W2K file server
2.  Ethernet bonding isn't configured properly and all iSCSI traffic
    for this W2K DC is over a single GbE link, limiting throughput to
    less than 100MB/s.
3.  All traffic, user and iSCSI, traversing a single link.
4.  A deficiency in the iSCSI configuration yielding significantly less
    than 100MB/s throughput.
5.  A deficiency in IO traffic between the W2K guest and the Xen host.
6.  And number of kernel tuning issues on the W2K DC guest causing
    network and/or iSCSI IO issues, memory allocation problems, pagefile
    problems, etc.
7.  A problem with the 16 port GbE switch, bonding or other.  It would
    be very worthwhile to gather metrics from the switch for the ports
    connected to the Xen host with the W2K DC, and the storage server.
    This could prove to be enlightening.

-- 
Stan





^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance
  2013-02-07 23:48                 ` Chris Murphy
  2013-02-08  0:02                   ` Chris Murphy
@ 2013-02-08  6:15                   ` Adam Goryachev
  1 sibling, 0 replies; 131+ messages in thread
From: Adam Goryachev @ 2013-02-08  6:15 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Phil Turmel, Dave Cundiff, linux-raid

On 08/02/13 10:48, Chris Murphy wrote:
> 
> ***On Feb 7, 2013, at 6:08 AM, Adam Goryachev
> <mailinglists@websitemanagers.com.au> wrote:
>> Basically, on occasion, when a user copies a large file from disk
>> to disk, or when a user is using Outlook (frequently data files are
>> over 2G), or just general workload, the system will "stall",
>> sometimes causing user level errors, which mostly affects Outlook.
> 
> Does this concern anyone else? In particular the user doing "disk to
> disk" large file copies. What is this exactly? LV to LV with iSCSI
> over 1gigE? Why did you reject NFS for these physical Windows boxes
> and their VMs to access this storage, rather than what I assume is
> NTFS over iSCSI, because of this statement?

This isn't a common thing (well, it happens once a week when a user logs
in after hours to do some sort of backup/DB maintenance), but it is the
easiest way to reproduce the problem, and from the evidence, it seems to
match.

ie, generally the problem is characterized as:
1) Large amount of read and write on one iSCSI device
2) User complain about write failures, slow response, etc even when 1
and 2 are on different VM's (which are on different physical machines).

>> Each LV is then exported via iSCSI
> That block device needs a file system for Windows to use it.
> 
> It also seems to me one or more of these physical servers running
> VMs, with only 1gigE to the storage server, need either additional
> pipes LACP or bonded ethernet, or 10gigE. I can just imagine one
> person doing a large file copy disk to disk, which is a single pipe
> doing a pull push, double NTFS packet overhead, while all other
> activities get immensely hit with network latency as a result.

However, this should only cause issues for users on the server which is
doing this. ie, if a user logs into terminal server 1, and copies a
large file from the desktop to another folder on the same c:, then this
terminal server will get busy, possibly using a full 1Gbps through the
VM, physical machine, switch, to the storage server. However, the
storage server has another 3Gbps to serve all the other systems. Also,
100MB/s is not an unreasonable performance level for a single system
(ok, minus overhead, even 60MB/s would probably equal what they had
before with 10 year old SCSI disks).

> ###On Feb 7, 2013, at 4:07 AM, Dave Cundiff <syshackmin@gmail.com>
> wrote:
> 
>> See page 17 for a block diagram of your motherboard… Your SSDs 
>> alone could saturate that if you performed a local operation. Get
>> your NIC's going at 4Gig and all of it a sudden you'll really want
>> that SATA card in slot 4 or 5.
> 
> Yeah I think it needs all the network performance and reduced latency
> as he can get. I'll be surprised if the SSD tuning alone makes much
> of a dent with this.

I still need to go in (tomorrow night) and pull apart the machine
physically to confirm which slot the network cards are in, but based on
the other comments, I don't think this is the limiting factor.... Slap
me if it is and I'll drive in tonight and check it sooner.

Thanks,
Adam

-- 
Adam Goryachev
Website Managers
www.websitemanagers.com.au
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance
  2013-02-08  0:02                   ` Chris Murphy
@ 2013-02-08  6:25                     ` Adam Goryachev
  2013-02-08  7:35                       ` Chris Murphy
  0 siblings, 1 reply; 131+ messages in thread
From: Adam Goryachev @ 2013-02-08  6:25 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Phil Turmel, Dave Cundiff, linux-raid

On 08/02/13 11:02, Chris Murphy wrote:
> 
> On Feb 7, 2013, at 4:48 PM, Chris Murphy <lists@colorremedies.com>
> wrote:
>> 
>> ***On Feb 7, 2013, at 6:08 AM, Adam Goryachev
>> <mailinglists@websitemanagers.com.au> wrote:
>> 
>>> Basically, on occasion, when a user copies a large file from disk
>>> to disk,
> 
>> Why did you reject NFS …
> 
> A sign of brain deficiency, responding to myself…
> 
> Maybe this is an array to array transfer (?) and not LV to LV
> transfer where both LVs are on this SSD array? If it's the latter,
> geez, use NFS. This transaction is reduced to a rename (i.e. a move).
> Super fast.
> 
> Even if it's the former, with different block devices on the same
> server, I'm pretty sure NFSv4 can reduce this to a cp between
> devices, rather than piping the data all the way to the client, just
> to have the client pipe it back to the server. Although I'm sure
> someone will correct me if I'm wrong.

OK, so again, currently I have:
RAID5
DRBD
LVM2
iSCSI

On the remote machine...
iSCSI connects to server and presents block device /dev/sdX
Xen which passes through the block device to domU (Windows)
disk partition
partition is formatted NTFS


Now, alternatives to the above involving NFS I imagine are:
RAID5
DRBD
ext4 (or whatever filesystem format) to create a large file
NFS (export the large files to the xen physical machine

On the remote machine....
NFS mount
loop to present the NFS file as a block device
Xen which passes through the block device to domU (Windows)
disk partition
partition is formatted NTFS


LVM offers some advantages:
snapshots (disabled to try to improve performance, but in theory...)
easy to increase allocated space to a single VM
Keeps the entire system as simple block devices. There is only one
filesystem (NTFS), everything else is just block devices translated
through from layer to lower layer.

I'm not sure, but it was my understanding that using block devices was
the most efficient way to do this....

Potentially, if the block devices are mis-aligned in some way, I assume
this could result in some performance loss, but again I'm not sure it
should be this dramatic.

Any comments?

Thanks,
Adam

-- 
Adam Goryachev
Website Managers
www.websitemanagers.com.au
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance
  2013-02-08  3:32       ` RAID performance Stan Hoeppner
@ 2013-02-08  7:11         ` Adam Goryachev
  2013-02-08 17:10           ` Stan Hoeppner
  2013-02-08  7:17         ` Adam Goryachev
  1 sibling, 1 reply; 131+ messages in thread
From: Adam Goryachev @ 2013-02-08  7:11 UTC (permalink / raw)
  To: stan; +Cc: Dave Cundiff, linux-raid

On 08/02/13 14:32, Stan Hoeppner wrote:
> On 2/7/2013 5:07 AM, Dave Cundiff wrote:
> 
>> Its not going to help your remote access any. From your configuration
>> it looks like you are limited to 4 gigabits. At least as long as your
>> NICs are not in the slot shared with the disks. If they are you might
>> get some contention.
>>
>> http://download.intel.com/support/motherboards/server/sb/g13326004_s1200bt_tps_r2_0.pdf
>>
>> See page 17 for a block diagram of your motherboard. You have a 4x DMI
>> connection that PCI slot 3, your disks, and every other onboard device
>> share. That should be about 1.2GB(10Gigabits) of bandwidth. 
> 
> This is not an issue.  The C204 to LGA1155 connection is 4 lane DMI 2.0,
> not 1.0, so that's 40Gb/s and 5GB/s duplex, 2.5GB/s each way, which is
> more than sufficient for his devices.

Thanks, good to know, though I will still check that the network cards
are in the right slots (ie, in slot 4 and 5)

>> Your SSDs
>> alone could saturate that if you performed a local operation. 
> 
> See above.  However, using an LSI  9211-8i, or better yet a 9207-8i, in
> SLOT6 would be more optimal:
> 
> 1.  These board's ASICs are capable of 320K and 700K IOPS respectively.
>  As good as it may be, the Intel C204 Southbridge SATA IO processor is
> simply not in this league.  Whether it is a bottleneck in this case is
> unknown at this time, but it's a possibility, as the C204 wasn't
> designed with SSDs in mind.
> 
> 2.  SLOT6 is PCIe x8 with 8GB/s bandwidth, 4GB/s each way, which can
> handle the full bandwidth of 8 of these Intel 480GB SSDs.

OK, so potentially, I may need to get a new controller board.
Is there a test I can run which will determine the capability of the
chipset? I can shutdown all the VM's tonight, and run the required tests...

>> Get your
>> NIC's going at 4Gig and all of it a sudden you'll really want that
>> SATA card in slot 4 or 5.
> 
> Which brings me to the issue of the W2K DC that seems to be at the root
> of the performance problems.  Adam mentioned one scenario, where a user
> was copying a 50GB file from "one drive to another" through the Windows
> DC.  That's a big load on any network, and would tie up both bonded GbE
> links for quite a while.  All of these Windows machines are VM guests
> whose local disks are apparently iSCSI targets on the server holding the
> SSD md/RAID5 array.  This suggests a few possible causes:
> 
> 1.  Ethernet interface saturation on Xen host under this W2K file server
> 2.  Ethernet bonding isn't configured properly and all iSCSI traffic
>     for this W2K DC is over a single GbE link, limiting throughput to
>     less than 100MB/s.

From the switch stats, ports 5 to 8 are the bonded ports on the storage
server (iSCSI traffic):

Int  PacketsRX  ErrorsRX  BroadcastRX PacketsTX  ErrorsTX  BroadcastTX
5    734007958  0         110         120729310  0         0
6    733085348  0         114         54059704   0         0
7    734264296  0         113         45917956   0         0
8    732964685  0         102         95655835   0         0

So, traffic seems reasonably well balanced across all four links, though
these stats were reset 16.5 days ago, so I'm not sure if they have
wrapped. The PacketsTX numbers look a little funny, with 5 and 8 getting
double compared to 6 and 7, but they are all certainly in use.

> 3.  All traffic, user and iSCSI, traversing a single link.

This is true for the VM's, but "all traffic" is mostly iSCSI, user
traffic is just RDP, which is minimal.

> 4.  A deficiency in the iSCSI configuration yielding significantly less
>     than 100MB/s throughput.

Possible, but in the past, silly, useless, performance testing with a
single physical machine, all VM's stopped, produced read performance of
100 to 110MB/s (using dd with DIRECT option). I also tested two in
parallel, and did see some reduction, but I think both would get around
90MB/s... I can do more of this testing tonight. (This testing did
produce one issue that some machines were only getting 70MB/s, but this
was due to being connected to a second gigabit switch using a single
gigabit port for the uplink. Now, all physical machines and all 4 of the
iSCSI ports are on the same switch.

> 5.  A deficiency in IO traffic between the W2K guest and the Xen host.

I can try and do some basic testing here.... The xen hosts have a single
Intel SSD drive, but I think disk space is very limited. I might be able
to copy a small winXP onto a local disk to test performance.

> 6.  And number of kernel tuning issues on the W2K DC guest causing
>     network and/or iSCSI IO issues, memory allocation problems, pagefile
>     problems, etc.

On most of the terminal servers and application servers, they were using
the pagefile during peak load (win2003 is limited to 4G RAM), so I have
allocated 4GB of RAM as a block device, and allocated to windows, which
windows then places a 4G pagefile on it. This removed the pagefile load
from the network and storage system, but had minimal noticeable impact.

> 7.  A problem with the 16 port GbE switch, bonding or other.  It would
>     be very worthwhile to gather metrics from the switch for the ports
>     connected to the Xen host with the W2K DC, and the storage server.
>     This could prove to be enlightening.

The win2k DC is on physical machine 1 which is on port 9 of the switch,
I've included the above stats here as well:

Int  PacketsRX  ErrorsRX  BroadcastRX PacketsTX  ErrorsTX  BroadcastTX
5    734007958  0         110         120729310  0         0
6    733085348  0         114         54059704   0         0
7    734264296  0         113         45917956   0         0
8    732964685  0         102         95655835   0         0
9    1808508983 0         72998       1942345594 0         0

I can also see very detailed stats per port, I'll show port 9 detailed
stats here, and comment below. If you think any other port would be
useful, please let me know.

Interface				g9
MST ID					CST
ifIndex					9
Port Type
Port Channel ID				Disable
Port Role				Disabled
STP Mode
STP State				Manual forwarding
Admin Mode				Enable
LACP Mode				Enable
Physical Mode				Auto
Physical Status				1000 Mbps Full Duplex
Link Status				Link Up
Link Trap				Enable
Packets RX and TX 64 Octets		49459211
Packets RX and TX 65-127 Octets		1618637216
Packets RX and TX 128-255 Octets	226809713
Packets RX and TX 256-511 Octets	26365450
Packets RX and TX 512-1023 Octets	246692277
Packets RX and TX 1024-1518 Octets	1587427388
Packets RX and TX > 1522 Octets		0
Octets Received				625082658823
Packets Received 64 Octets		15738586
Packets Received 65-127 Octets		1232246454
Packets Received 128-255 Octets		104644153
Packets Received 256-511 Octets		9450877
Packets Received 512-1023 Octets	208875645
Packets Received 1024-1518 Octets	239934983
Packets Received > 1522 Octets		0
Total Packets Received Without Errors	1810890698
Unicast Packets Received		1810793833
Multicast Packets Received		23697
Broadcast Packets Received		73168
Total Packets Received with MAC Errors	0
Jabbers Received			0
Fragments Received			0
Undersize Received			0
Alignment Errors			0
Rx FCS Errors				0
Overruns				0
Total Received Packets Not Forwarded	0
Local Traffic Frames			0
802.3x Pause Frames Received		0
Unacceptable Frame Type			0
Multicast Tree Viable Discards		0
Reserved Address Discards		0
Broadcast Storm Recovery		0
CFI Discards				0
Upstream Threshold			0
Total Packets Transmitted (Octets)	2070575251257
Packets Transmitted 64 Octets		33720625
Packets Transmitted 65-127 Octets	386390762
Packets Transmitted 128-255 Octets	122165560
Packets Transmitted 256-511 Octets	16914573
Packets Transmitted 512-1023 Octets	37816632
Packets Transmitted 1024-1518 Octets	1347492405
Packets Transmitted > 1522 Octets	0
Maximum Frame Size			1518
Total Packets Transmitted Successfully	1944500557
Unicast Packets Transmitted		1940616380
Multicast Packets Transmitted		2164121
Broadcast Packets Transmitted		1720056
Total Transmit Errors			0
Tx FCS Errors				0
Underrun Errors				0
Total Transmit Packets Discarded	0
Single Collision Frames			0
Multiple Collision Frames		0
Excessive Collision Frames		0
Port Membership Discards		0
STP BPDUs Received			0
STP BPDUs Transmitted			0
RSTP BPDUs Received			0
RSTP BPDUs Transmitted			0
MSTP BPDUs Received			0
MSTP BPDUs Transmitted			0
802.3x Pause Frames Transmitted		1230476
EAPOL Frames Received			0
EAPOL Frames Transmitted		0
Time Since Counters Last Cleared	16 day 16 hr 11 min 52 sec

To me, there are two interesting bits of information here.

1) We can see a breakdown of packet sizes, and this shows no jumbo
frames at all, I'm not real sure if this would help, or how to go about
configuring it, though I guess it only needs to be done within the linux
storage server, linux physical machines, and the switch.

2) The value for Pause Frames Transmitted, I'm not sure what this is,
but it doesn't sound like a good thing....
http://en.wikipedia.org/wiki/Ethernet_flow_control
Seems to indicate that the switch is telling the physical machine to
slow down sending data, and if these happen at even time intervals, then
that is an average of one per second for the past 16 days.....

Looking at the port 5 (one of the ports connected to the storage server)
this value is much higher (approx 24 per second averaged over 16 days).

I can understand that the storage server can send faster that any
individual receiver, so I can see why the switch might tell it to slow
down, but I don't see why the switch would tell the physical machine to
slow down.

So, to summarise, I think I need to look into the network performance,
and find out what is going on there. However, before I get too complex,
I'd like to confirm that things are working properly on the local
machine at the RAID5, and hopefully DRBD + LVM layers. So, I got some
fio tests last night to run, which I'll do afterhours tonight, and then
post those stats. If that shows:
1) RAID5 performance is excellent, then I should be able to avoid
purchase of an extra controller card, and mark the SATA chipset OK
2) DRBD performance is excellent, then I can ignore config errors there
3) LVM performance is excellent, then I can ignore config errors there

That then leaves me with iSCSI issues, network issues, etc, but, like I
said, one thing at a time.

Are there any other or related tests you think I should be running on
the local machine to ensure things are working properly? Any other
suggestions or information I need to provide? Should I setup and start
graphing some of these values from the switch? I'm sure it supports
SNMP, so I could poll and dump the values into some RRD files for analysis.

Regards,
Adam

-- 
Adam Goryachev
Website Managers
www.websitemanagers.com.au

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance
  2013-02-08  3:32       ` RAID performance Stan Hoeppner
  2013-02-08  7:11         ` Adam Goryachev
@ 2013-02-08  7:17         ` Adam Goryachev
  1 sibling, 0 replies; 131+ messages in thread
From: Adam Goryachev @ 2013-02-08  7:17 UTC (permalink / raw)
  To: stan; +Cc: Dave Cundiff, linux-raid

On 08/02/13 14:32, Stan Hoeppner wrote:

> 6.  And number of kernel tuning issues on the W2K DC guest causing
>     network and/or iSCSI IO issues, memory allocation problems, pagefile
>     problems, etc.

I forgot to mention something else under consideration here... I recall
an older windows 2003 VM which was running service pack 2 being very
slow under xen, and the solution was to upgrade to service pack 4 (from
the xen mailing list, advised there was something windows did to make
xen very in-efficient pre sp4). I don't recall if this applied to
win2000, but I will probably upgrade the win2k DC to windows 2003
service pack 4 over the coming weekend (depending on approval/etc).
AFAIK, this should be a fairly painless upgrade, and is definitely
something I want to get done anyway.

I'd still like to focus on running some tests to ensure the system is
working properly, and/or find the issues in the other parts and resolve them

Thank you to everyone who has offered any suggestions or advice so far.

Regards,
Adam

-- 
Adam Goryachev
Website Managers
www.websitemanagers.com.au

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance
  2013-02-07  8:11 ` Stan Hoeppner
  2013-02-07 10:05   ` Adam Goryachev
@ 2013-02-08  7:21   ` Adam Goryachev
  2013-02-08  7:37     ` Chris Murphy
  2013-02-08 13:04     ` Stan Hoeppner
  1 sibling, 2 replies; 131+ messages in thread
From: Adam Goryachev @ 2013-02-08  7:21 UTC (permalink / raw)
  To: stan; +Cc: linux-raid

On 07/02/13 19:11, Stan Hoeppner wrote:
> On 2/7/2013 12:48 AM, Adam Goryachev wrote:
> Switching to noop may help a little, as may disablig NCQ, i.e. putting
> the driver in native IDE mode, or setting queue depth to 1.
> 

I changed these two settings last night (noop and nr_request = 1) and
today seemed to produce more complaints, and more errors logged about
write failures, so I have restored nr_requests to 128 and the scheduler
back to deadline.

Regards,
Adam


-- 
Adam Goryachev
Website Managers
www.websitemanagers.com.au

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance
  2013-02-08  6:25                     ` Adam Goryachev
@ 2013-02-08  7:35                       ` Chris Murphy
  2013-02-08  8:34                         ` Chris Murphy
  2013-02-08 14:19                         ` Adam Goryachev
  0 siblings, 2 replies; 131+ messages in thread
From: Chris Murphy @ 2013-02-08  7:35 UTC (permalink / raw)
  To: Adam Goryachev
  Cc: Phil Turmel, Dave Cundiff, stan@hardwarefreak.com Hoeppner,
	linux-raid@vger.kernel.org list


On Feb 7, 2013, at 11:25 PM, Adam Goryachev <mailinglists@websitemanagers.com.au> wrote:
> 
> 
> On the remote machine....
> NFS mount
> loop to present the NFS file as a block device
> Xen which passes through the block device to domU (Windows)
> disk partition
> partition is formatted NTFS

Assuming the domU gets it's own IP, Windows will mount NFS directly. You don't need to format it. On the storage server, storage is ext4 or XFS and can be on LVM if you wish.

> I'm not sure, but it was my understanding that using block devices was
> the most efficient way to do this….

Depends on the usage. Files being copied and/or moved on the same storage array sounds like a file sharing context to me, not a block device requirement. And user report of write failures over iSCSI bothers me also. NFS is going to be much more fault tolerant, and all of your domUs can share one pile of storage. But as you have it configured, you've effectively over provisioned if each domU gets its own LV, all the more reason I don't think you need to do more over provisioning. And for now I think NFS vs iSCSI can wait another day, and that your problem lies elsewhere on the network.

Do you have internet network traffic going through this same switch? Or do you have the storage network isolated such that *only* iSCSI traffic is happening on a given wire?

> ie, if a user logs into terminal server 1, and copies a
> large file from the desktop to another folder on the same c:, then this
> terminal server will get busy, possibly using a full 1Gbps through the
> VM, physical machine, switch, to the storage server. However, the
> storage server has another 3Gbps to serve all the other systems.

I think you need to replicate the condition that causes the problem, on the storage server itself first, to isolate this from being a network problem. And I'd do rather intensive read tests first and then do predominately write tests to see if there's a distinct difference (above what's expected for the RAID 5 write hit). And then you can repeat these from your domUs.

I'm flummoxed off hand if an NTFS formatted iSCSI block device behaves exactly as an NTFS formatted LV; ergo, is it possible (and OK) to unmount the volumes on the domUs, and then mount the LV as NTFS on the storage server so that your storage server can run local tests, simultaneously to those LVs. Obviously you should not mount the LVs on the storage server while they are mounted over iSCSI or you'll totally corrupt the file system (and it will let you do this, a hazard of iSCSI).


Chris Murphy--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance
  2013-02-08  7:21   ` RAID performance Adam Goryachev
@ 2013-02-08  7:37     ` Chris Murphy
  2013-02-08 13:04     ` Stan Hoeppner
  1 sibling, 0 replies; 131+ messages in thread
From: Chris Murphy @ 2013-02-08  7:37 UTC (permalink / raw)
  To: Adam Goryachev; +Cc: stan, linux-raid


On Feb 8, 2013, at 12:21 AM, Adam Goryachev <mailinglists@websitemanagers.com.au> wrote:

> I changed these two settings last night (noop and nr_request = 1) and
> today seemed to produce more complaints, and more errors logged about
> write failures, so I have restored nr_requests to 128 and the scheduler
> back to deadline.

Specifically what errors are being logged? And if this is on clients, what are you getting in dmesg on the storage server at the time of these errors? There might be a lot.


Chris Murphy

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance
  2013-02-08  7:35                       ` Chris Murphy
@ 2013-02-08  8:34                         ` Chris Murphy
  2013-02-08 14:31                           ` Adam Goryachev
  2013-02-08 14:19                         ` Adam Goryachev
  1 sibling, 1 reply; 131+ messages in thread
From: Chris Murphy @ 2013-02-08  8:34 UTC (permalink / raw)
  To: Adam Goryachev
  Cc: stan@hardwarefreak.com Hoeppner, Phil Turmel,
	linux-raid@vger.kernel.org list


On Feb 8, 2013, at 12:35 AM, Chris Murphy <lists@colorremedies.com> wrote:
> 
> I'm flummoxed off hand if an NTFS formatted iSCSI block device behaves exactly as an NTFS formatted LV; 

The answer depends on whether you formatted the iSCSI target directly or partitioned it, then formatted the partition. If the latter, use kpartx -av to make the partition available under /dev/mapper and then you mount the device kpartx returns. Anyway the point was to run benchmarks on these; but if you have space in the VG you can make new (smaller) LVs for this, it's slightly safer.

I'd benchmark one LV as a reference. Then two simultaneously. Then three. I think the more basic the test for now the better or there's just too much data. Then I'd do that same test on physical servers, over iSCSI (one, two, then three simultaneous targets). Then do that same test within a VM, over iSCSI.

Trust Stan and Phil on this more than me. I just think something appallingly obvious will materialize with simple tests, and then you can dig down deeper with more specific tests.

First though, sort out this write error business. That bugs me more than the benchmarking stuff. Client and server side write errors in an iSCSI context make me think of file system corruption, either the write error caused by file system corruption, or creating it.


Chris Murphy

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance
  2013-02-08  7:21   ` RAID performance Adam Goryachev
  2013-02-08  7:37     ` Chris Murphy
@ 2013-02-08 13:04     ` Stan Hoeppner
  1 sibling, 0 replies; 131+ messages in thread
From: Stan Hoeppner @ 2013-02-08 13:04 UTC (permalink / raw)
  To: Adam Goryachev; +Cc: linux-raid

On 2/8/2013 1:21 AM, Adam Goryachev wrote:
> On 07/02/13 19:11, Stan Hoeppner wrote:
>> On 2/7/2013 12:48 AM, Adam Goryachev wrote:
>> Switching to noop may help a little, as may disablig NCQ, i.e. putting
>> the driver in native IDE mode, or setting queue depth to 1.
>>
> 
> I changed these two settings last night (noop and nr_request = 1) and
> today seemed to produce more complaints, and more errors logged about
> write failures, so I have restored nr_requests to 128 and the scheduler
> back to deadline.

/sys/block/sda/queue/nr_requests has nothing to do with the SATA queue
depth.  nr_requests controls the queue size of the scheduler (elevator).
 Decreasing that to a value of 1 will obviously have dire consequences,
dramatically decreasing SSD throughput.

I was referring to the NCQ queue depth on the C204 SATA controller.
This may or may not be manually configurable with that hardware/driver,
but fully autonegotiated between the chip and the SSDs.  Which is why I
mentioned switching to the native IDE mode, which disabled NCQ entirely.
 However, disabling NCQ, if it helps, is a very minor performance
optimization, and won't have significant impact on your problem.

Regardless, after reading your previous email, which I'll respond to
next, it seems pretty clear your overarching problem is a network
architecture oversight/flaw.

-- 
Stan


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance
  2013-02-07 15:32         ` Dave Cundiff
@ 2013-02-08 13:58           ` Adam Goryachev
  2013-02-08 21:42             ` Stan Hoeppner
  2013-02-17  9:52             ` RAID performance - new kernel results Adam Goryachev
  0 siblings, 2 replies; 131+ messages in thread
From: Adam Goryachev @ 2013-02-08 13:58 UTC (permalink / raw)
  To: Dave Cundiff; +Cc: linux-raid

On 08/02/13 02:32, Dave Cundiff wrote:
> On Thu, Feb 7, 2013 at 7:49 AM, Adam Goryachev
> <mailinglists@websitemanagers.com.au> wrote:
>>>
>>> I definitely see that. See below for a FIO run I just did on one of my RAID10s

OK, some fio results.

Firstly, this is done against /tmp which is on the single standalone
Intel SSD used for the rootfs (shows some performance level of the
chipset I presume):

root@san1:/tmp/testing# fio /root/test.fio
seq-read: (g=0): rw=read, bs=64K-64K/64K-64K, ioengine=libaio, iodepth=32
seq-write: (g=1): rw=write, bs=64K-64K/64K-64K, ioengine=libaio, iodepth=32
Starting 2 processes
seq-read: Laying out IO file(s) (1 file(s) / 4096MB)
Jobs: 1 (f=1): [_W] [100.0% done] [0K/137M /s] [0/2133 iops] [eta 00m:00s]
seq-read: (groupid=0, jobs=1): err= 0: pid=4932
  read : io=4096MB, bw=518840KB/s, iops=8106, runt=  8084msec
seq-write: (groupid=1, jobs=1): err= 0: pid=5138
  write: io=4096MB, bw=136405KB/s, iops=2131, runt= 30749msec
Run status group 0 (all jobs):
   READ: io=4096MB, aggrb=518840KB/s, minb=531292KB/s, maxb=531292KB/s,
mint=8084msec, maxt=8084msec

Run status group 1 (all jobs):
  WRITE: io=4096MB, aggrb=136404KB/s, minb=139678KB/s, maxb=139678KB/s,
mint=30749msec, maxt=30749msec

Disk stats (read/write):
  sda: ios=66570/66363, merge=10297/10453, ticks=259152/993304,
in_queue=1252592, util=99.34%


PS, I'm assuming I should omit the extra output similar to what you
did.... If I should include all info, I can re-run and provide...

This seems to indicate a read speed of 531M and write of 139M, which to
me says something is wrong. I thought write speed is slower, but not
that much slower?

Moving on, I've stopped the secondary DRBD, created a new LV (testlv) of
15G, and formatted with ext4, mounted it, and re-run the test:

seq-read: (groupid=0, jobs=1): err= 0: pid=19578
  read : io=4096MB, bw=640743KB/s, iops=10011, runt=  6546msec
seq-write: (groupid=1, jobs=1): err= 0: pid=19997
  write: io=4096MB, bw=208765KB/s, iops=3261, runt= 20091msec
Run status group 0 (all jobs):
   READ: io=4096MB, aggrb=640743KB/s, minb=656120KB/s, maxb=656120KB/s,
mint=6546msec, maxt=6546msec

Run status group 1 (all jobs):
  WRITE: io=4096MB, aggrb=208765KB/s, minb=213775KB/s, maxb=213775KB/s,
mint=20091msec, maxt=20091msec

Disk stats (read/write):
  dm-14: ios=65536/64841, merge=0/0, ticks=206920/469464,
in_queue=676580, util=98.89%, aggrios=0/0, aggrmerge=0/0, aggrticks=0/0,
aggrin_queue=0, aggrutil=0.00%
    drbd2: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=-nan%

dm-14 is the testlv

So, this indicates a max read speed of 656M and write of 213M, again,
write is very slow (about 30%).

With these figures, just 2 x 1Gbps links would saturate the write
performance of this RAID5 array.

Finally, changing the fio config file to point filename=/dev/vg0/testlv
(ie, raw LV, no filesystem):
seq-read: (groupid=0, jobs=1): err= 0: pid=10986
  read : io=4096MB, bw=652607KB/s, iops=10196, runt=  6427msec
seq-write: (groupid=1, jobs=1): err= 0: pid=11177
  write: io=4096MB, bw=202252KB/s, iops=3160, runt= 20738msec
Run status group 0 (all jobs):
   READ: io=4096MB, aggrb=652606KB/s, minb=668269KB/s, maxb=668269KB/s,
mint=6427msec, maxt=6427msec

Run status group 1 (all jobs):
  WRITE: io=4096MB, aggrb=202252KB/s, minb=207106KB/s, maxb=207106KB/s,
mint=20738msec, maxt=20738msec

Not much difference, which I didn't really expect...

So, should I be concerned about these results? Do I need to try to
re-run these tests at a lower layer (ie, remove DRBD and/or LVM from the
picture)? Are these meaningless and I should be running a different
test/set of tests/etc ?

Thanks,
Adam





-- 
Adam Goryachev
Website Managers
www.websitemanagers.com.au

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance
  2013-02-08  7:35                       ` Chris Murphy
  2013-02-08  8:34                         ` Chris Murphy
@ 2013-02-08 14:19                         ` Adam Goryachev
  1 sibling, 0 replies; 131+ messages in thread
From: Adam Goryachev @ 2013-02-08 14:19 UTC (permalink / raw)
  To: Chris Murphy
  Cc: Phil Turmel, Dave Cundiff, stan@hardwarefreak.com Hoeppner,
	linux-raid@vger.kernel.org list

On 08/02/13 18:35, Chris Murphy wrote:
> 
> On Feb 7, 2013, at 11:25 PM, Adam Goryachev
> <mailinglists@websitemanagers.com.au> wrote:
>> 
>> 
>> On the remote machine.... NFS mount loop to present the NFS file as
>> a block device Xen which passes through the block device to domU
>> (Windows) disk partition partition is formatted NTFS
> 
> Assuming the domU gets it's own IP, Windows will mount NFS directly.
> You don't need to format it. On the storage server, storage is ext4
> or XFS and can be on LVM if you wish.

Are you suggesting that MS Windows 2003 Server (without any commercial
add-on software) will boot from NFS and run normally (no user noticable
changes) with it's C: actually a bunch of files on an NFS server?

I must admit, if that is possible, I'll be... better educated. I don't
think it is, hence I've gone with iSCSI which allows me to present a
block device to windows. I had considered configuring windows to
actually boot from iSCSI, which I think is mostly possible, but apart
from the added complexity, I've also heard it ends up with worse
performance as the emulated network card is less efficient than the
emulated disk + native network card. (Also the host gets extra CPU
allocation than the windows VM).

>> I'm not sure, but it was my understanding that using block devices
>> was the most efficient way to do this….
> 
> Depends on the usage. Files being copied and/or moved on the same
> storage array sounds like a file sharing context to me, not a block
> device requirement. And user report of write failures over iSCSI
> bothers me also. NFS is going to be much more fault tolerant, and all
> of your domUs can share one pile of storage. But as you have it
> configured, you've effectively over provisioned if each domU gets its
> own LV, all the more reason I don't think you need to do more over
> provisioning. And for now I think NFS vs iSCSI can wait another day,
> and that your problem lies elsewhere on the network.
> 
> Do you have internet network traffic going through this same switch?
> Or do you have the storage network isolated such that *only* iSCSI
> traffic is happening on a given wire?

There isn't any actual "internet traffic" as that all comes into a linux
firewall with ip forwarding disabled (and no NAT), only squid proxy, and
SMTP is available to forward traffic out. In any case, yes, there is a
single 1G ethernet in each physical box which shares all the SAN
traffic, as well as the user level traffic.

>> ie, if a user logs into terminal server 1, and copies a large file
>> from the desktop to another folder on the same c:, then this 
>> terminal server will get busy, possibly using a full 1Gbps through
>> the VM, physical machine, switch, to the storage server. However,
>> the storage server has another 3Gbps to serve all the other
>> systems.
> 
> I think you need to replicate the condition that causes the problem,
> on the storage server itself first, to isolate this from being a
> network problem. And I'd do rather intensive read tests first and
> then do predominately write tests to see if there's a distinct
> difference (above what's expected for the RAID 5 write hit). And then
> you can repeat these from your domUs.

OK, well, I've started running some performance tests on the storage
server, I'd like to find out if they are "expected results", and then
will move on to test over the network.

> I'm flummoxed off hand if an NTFS formatted iSCSI block device
> behaves exactly as an NTFS formatted LV; ergo, is it possible (and
> OK) to unmount the volumes on the domUs, and then mount the LV as
> NTFS on the storage server so that your storage server can run local
> tests, simultaneously to those LVs. Obviously you should not mount
> the LVs on the storage server while they are mounted over iSCSI or
> you'll totally corrupt the file system (and it will let you do this,
> a hazard of iSCSI).

Yes, I've no problem mounting an LV directly on the storage server, I've
done that before for testing/migration of physical machines. Of course,
as you mentioned, not while the VM is actually running!

Thanks,
Adam

-- 
Adam Goryachev
Website Managers
www.websitemanagers.com.au
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance
  2013-02-08  8:34                         ` Chris Murphy
@ 2013-02-08 14:31                           ` Adam Goryachev
  0 siblings, 0 replies; 131+ messages in thread
From: Adam Goryachev @ 2013-02-08 14:31 UTC (permalink / raw)
  To: Chris Murphy
  Cc: stan@hardwarefreak.com Hoeppner, Phil Turmel,
	linux-raid@vger.kernel.org list

On 08/02/13 19:34, Chris Murphy wrote:
> 
> On Feb 8, 2013, at 12:35 AM, Chris Murphy <lists@colorremedies.com>
> wrote:
>> 
>> I'm flummoxed off hand if an NTFS formatted iSCSI block device
>> behaves exactly as an NTFS formatted LV;
> 
> The answer depends on whether you formatted the iSCSI target directly
> or partitioned it, then formatted the partition. If the latter, use
> kpartx -av to make the partition available under /dev/mapper and then
> you mount the device kpartx returns. Anyway the point was to run
> benchmarks on these; but if you have space in the VG you can make new
> (smaller) LVs for this, it's slightly safer.
> 
> I'd benchmark one LV as a reference. Then two simultaneously. Then
> three. I think the more basic the test for now the better or there's
> just too much data. Then I'd do that same test on physical servers,
> over iSCSI (one, two, then three simultaneous targets). Then do that
> same test within a VM, over iSCSI.
> 
> Trust Stan and Phil on this more than me. I just think something
> appallingly obvious will materialize with simple tests, and then you
> can dig down deeper with more specific tests.
> 
> First though, sort out this write error business. That bugs me more
> than the benchmarking stuff. Client and server side write errors in
> an iSCSI context make me think of file system corruption, either the
> write error caused by file system corruption, or creating it.

The write errors are pop-ups on windows saying "Delayed Write Failure -
some part of the file blahblah could not be written, this may be due to
network or system errors, etc..."

This message is also recorded in the "Event Viewer", and also logged to
the "console" as individual message boxes that you can click OK to
dismiss each one.

Sorry, I've shutdown everything now so don't have the full message, but
in effect, windows fails to write the data, and that write is thrown
away. Generally the application will terminate, and re-starting is OK.
Often, Excel will end up with a corrupted file, and the file needs to be
recovered from the backup system (at least we have one). Also, Outlook
will have problems recovering, and MYOB will frequently fail  and crash
(MYOB is australian commercial accounting application).

IMHO, these are significant issues, I'd be happy to have slow
performance, as long as no data was lost or corrupted (well, happier :)).


Since restarting the physical box that is running the windows 2000 DC
last night (less than 24 hours), there were zero error messages in the
dmesg output. Just the startup and shutdown messages to create and
destroy the virtual network interface when the VM was booted and shutdown.

On the storage server, there are zero messages in the kern.log file
between 6:57am and 8:57pm (which is when the DRBD connection to the
second server is stopped and started). Same applies to the messages and
syslog files....

I might have a network related issue, but I really want to ensure that
RAID/disk performance is right/sorted before I move on. Else I'll start
looking at lots of things and never properly tick anything off as OK.

Regards,
Adam

-- 
Adam Goryachev
Website Managers
www.websitemanagers.com.au

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance
  2013-02-08  7:11         ` Adam Goryachev
@ 2013-02-08 17:10           ` Stan Hoeppner
  2013-02-08 18:44             ` Adam Goryachev
  0 siblings, 1 reply; 131+ messages in thread
From: Stan Hoeppner @ 2013-02-08 17:10 UTC (permalink / raw)
  To: Adam Goryachev; +Cc: Dave Cundiff, linux-raid

The information you've provided below seems to indicate the root cause
of the problem.  The good news is that the fix(es) are simple, and
inexpensive.

I must say, now that I understand the problem, I'm wondering why you
used 4 bonded GbE ports on your iSCSI target server, yet employed a
single GbE port on the only machine that accesses it, according to the
information you've presented.  Based on that, this is the source of your
problem.  Keep reading.

On 2/8/2013 1:11 AM, Adam Goryachev wrote:

> OK, so potentially, I may need to get a new controller board.
> Is there a test I can run which will determine the capability of the
> chipset? I can shutdown all the VM's tonight, and run the required tests...

Forget all of this.  The problem isn't with the storage server, but your
network architecture.

> From the switch stats, ports 5 to 8 are the bonded ports on the storage
> server (iSCSI traffic):
> 
> Int  PacketsRX  ErrorsRX  BroadcastRX PacketsTX  ErrorsTX  BroadcastTX
> 5    734007958  0         110         120729310  0         0
> 6    733085348  0         114         54059704   0         0
> 7    734264296  0         113         45917956   0         0
> 8    732964685  0         102         95655835   0         0

I'm glad I asked you for this information.  This clearly shows that the
server is performing LACP round robin fanning nearly perfectly.  It also
shows that the bulk of the traffic coming from the W2K DC, which
apparently hosts the Windows shares for TS users, is being pumped to the
storage server over port 5, the first port in the switch's bonding
group.  The switch is doing adaptive load balancing with transmission
instead of round robin.  This is the default behavior of many switches
and is fine.

> So, traffic seems reasonably well balanced across all four links

The storage server's transmit traffic is well balanced out of the NICs,
but the receive traffic from the switch is imbalanced, almost 3:1
between ports 5 and 7.  This is due to the switch doing ALB, and helps
us diagnose the problem.

> The win2k DC is on physical machine 1 which is on port 9 of the switch,
> I've included the above stats here as well:
> 
> Int  PacketsRX  ErrorsRX  BroadcastRX PacketsTX  ErrorsTX  BroadcastTX
> 5    734007958  0         110         120729310  0         0
> 6    733085348  0         114         54059704   0         0
> 7    734264296  0         113         45917956   0         0
> 8    732964685  0         102         95655835   0         0

> 9    1808508983 0         72998       1942345594 0         0

And here the problem is brightly revealed.  This W2K DC box on port 9
hosting the shares for the terminal services users appears to be
funneling all of your file IO to/from the storage server via iSCSI, and
to/from the terminal servers via CIFS-- all over a single GbE interface.
 Normally this wouldn't be a big problem.  But you have users copying
50GB files over the network, to terminal server machines no less.

As seen from the switch metrics, when a user does a large file copy from
a share on one iSCSI target to a share on another iSCSI target, here is
what is happening:

1.  The W2K DC share server pulls the filesystem blocks over iSCSI
2.  The storage server pushes the packets out round robin at 4x the rate
    that the DC can accept them, saturating its receive port
3.  The switch issues back offs to the server NICs during the entire
    length of the copy operation due to the 4:1 imbalance.  The server
    is so over powered with SSD and 4x GbE links this doesn't bog it
    down, but it does give us valuable information as to the problem
4.  The DC upon receiving the filesystem blocks immediately transmits
    them back to the other iSCSI target on the storage server
5.  Now the DC's transmit interface is saturated
6.  So now both Tx/Rx ports on the DC NIC are saturated
7.  Now all CIFS traffic on all terminal servers is significantly
    delayed due to congestion at the DC, causing severe lag for others
    doing file operations to/from the DC shares.
8.  If the TS/roaming profiles are on a share on this DC server
    any operation touching a profile will be slow, especially
    logon/off, as your users surely have massive profiles, given
    they save multi GB files to their desktops

> 802.3x Pause Frames Transmitted		1230476

"Bingo" metric.

> 2) The value for Pause Frames Transmitted, I'm not sure what this is,
> but it doesn't sound like a good thing....
> http://en.wikipedia.org/wiki/Ethernet_flow_control
> Seems to indicate that the switch is telling the physical machine to
> slow down sending data, and if these happen at even time intervals, then
> that is an average of one per second for the past 16 days.....

The average is irrelevant.  The switch only sends pauses to the storage
server NICs when they're transmitting more frames/sec than the single
port to which the DC is attached can forward them.  More precisely,
pauses are issued every time the buffer on switch port 9 is full when
ports 5-8 attempt to forward a frame.  The buffer will be full because
the downstream GbE NIC can't swallow the frames fast enough.  You've got
1.2 million of these pause frames logged.  This is your beacon in the
dark, shining bright light on the problem.

> I can understand that the storage server can send faster that any
> individual receiver, so I can see why the switch might tell it to slow
> down, but I don't see why the switch would tell the physical machine to
> slow down.

It's not telling the "physical machine" to "slow down".  It's telling
the ethernet device to pause between transmissions to the target MAC
address which is connected to the switch port that is under load
distress.  Your storage server isn't slowing down your terminal servers
or the users apps running on them.  Your DC is.

> So, to summarise, I think I need to look into the network performance,

You just did, and helped put the final nail in the coffin.  You simply
didn't realize it.  And you may balk at the solution, as it is so
simple, and cheap.  The problem, and the solution are:

Problem:
W2K DC handles all the client CIFS file IO traffic with the terminal
servers, as well as all iSCSI IO to/from the storage server, over a
single GbE interface.  It has a 4:1 ethernet bandwidth deficit with the
storage server alone, causing massive network congestion at the DC
machine during large file transfers.  This in turn bogs down CIFS
traffic across all TS boxen, lagging the users.

Solution:
Simply replace the onboard single port GbE NIC in the W2K DC share
server with an Intel quad port GbE NIC, and configure LACP bonding with
the switch. Use ALB instead of RR.  Using ALB will prevent the DC share
server from overwhelming the terminal servers in the same manner the
storage server is currently doing the DC.  Leave the storage server as RR.

However, this doesn't solve the problem of one user on a terminal server
bogging down everyone else on the same TS box if s/he pulls a 50GB file
to his/her desktop.  But the degradation will now be limited to only
users on that one TS box.  If you want to mitigate this to a degree, use
two bonded NIC ports in the TS boxen.  Here you can use RR transmit
without problems, as 2 ports can't saturate the 4 on the DC's new 4 port
NIC.  A 50GB transfer will take 4-5 minutes instead of the current 8-10.
 But my $deity, why are people moving 50GB files across a small biz
network for Pete's sake...  If this is an ongoing activity, you need to
look into Windows user level IO limiting so you can prevent one person
from hogging all the IO bandwidth.  I've never run into this before so
you'll have to research it.  May be a policy for it if you're lucky.
I've always handled this kinda thing with a cluestick.  On to the
solution, or at least most of it.

http://www.intel.com/content/dam/doc/product-brief/ethernet-i340-server-adapter-brief.pdf

You want the I340-T4, 4 port copper, obviously.  Runs about $250 USD,
about $50 less than the I350-T4.  It's the best 4 port copper GbE NIC
for the money with all the features you need.  You're already using 2x
I350-T2s in the server so this card will be familiar WRT driver
configuration, etc.  It's $50 cheaper than the I350-T4 but with all the
needed features.

Crap, I just remembered you're using consumer Asus boards for the other
machines.  I just checked the manual for the Asus M5A88-M and it's not
clear if anything but a graphics card can be used in the x16 slot...

So, I'd acquire one 4 port PCIe x4 Intel card, and two of these Intel 2
port x1 cards (Intel doesn't offer a 2 port x1 card):
http://www.intel.com/content/dam/www/public/us/en/documents/product-briefs/pro-1000-pt-server-adapter-brief.pdf

If the 4 port x4 card won't work, use the two single port x1 cards with
LACP ALB.  In which case you'll also want to switch the NICs on the
iSCSI server to ALB, or you'll still have switch congestion.  The 4 port
400MB/s solution would be optimal, but 200MB/s is still double what you
have now, and will help alleviate the problem, but won't eliminate it.
I hope the 4 port PCIe x4 card will work in that board.

If you must use the PCIe x1 single port cards, you could try adding a
PRO 1000 PCI NIC, and Frankenstein these 3 together with the onboard
Realtek 8111 to get 4 ports.  That's uncharted territory for me.  I
always use matching NICs, or at least all from the same hardware family
using the same driver.

I hope I've provided helpful information.

Keep us posted.

-- 
Stan


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance
  2013-02-08 17:10           ` Stan Hoeppner
@ 2013-02-08 18:44             ` Adam Goryachev
  2013-02-09  4:09               ` Stan Hoeppner
  0 siblings, 1 reply; 131+ messages in thread
From: Adam Goryachev @ 2013-02-08 18:44 UTC (permalink / raw)
  To: stan; +Cc: Dave Cundiff, linux-raid

On 09/02/13 04:10, Stan Hoeppner wrote:
> The information you've provided below seems to indicate the root cause
> of the problem.  The good news is that the fix(es) are simple, and
> inexpensive.
> 
> I must say, now that I understand the problem, I'm wondering why you
> used 4 bonded GbE ports on your iSCSI target server, yet employed a
> single GbE port on the only machine that accesses it, according to the
> information you've presented.  Based on that, this is the source of your
> problem.  Keep reading.

Well, because the old SAN device had 4 x Gbps ports, and I copied that,
and I also didn't want an individual PC to flood the SAN... I guess I
never worked out that one PC was really driving 70% of the traffic....

>> From the switch stats, ports 5 to 8 are the bonded ports on the storage
>> server (iSCSI traffic):
>>
>> Int  PacketsRX  ErrorsRX  BroadcastRX PacketsTX  ErrorsTX  BroadcastTX
>> 5    734007958  0         110         120729310  0         0
>> 6    733085348  0         114         54059704   0         0
>> 7    734264296  0         113         45917956   0         0
>> 8    732964685  0         102         95655835   0         0
> 
> I'm glad I asked you for this information.  This clearly shows that the
> server is performing LACP round robin fanning nearly perfectly.  It also
> shows that the bulk of the traffic coming from the W2K DC, which
> apparently hosts the Windows shares for TS users, is being pumped to the
> storage server over port 5, the first port in the switch's bonding
> group.  The switch is doing adaptive load balancing with transmission
> instead of round robin.  This is the default behavior of many switches
> and is fine.

Is there some method to fix this on the switch? I have configured the
switch that those 4 ports are a single LAG, which I assumed meant the
switch would be smart enough to load balance properly... Guess I never
checked that side of it though...

>> So, traffic seems reasonably well balanced across all four links
> 
> The storage server's transmit traffic is well balanced out of the NICs,
> but the receive traffic from the switch is imbalanced, almost 3:1
> between ports 5 and 7.  This is due to the switch doing ALB, and helps
> us diagnose the problem.

The switch doesn't seem to have any setting to configure ALB or RR, or
at least I don't know what I'm looking for.... In any case, I suppose if
both sides of the network have equivalent bandwidth, then it should be
OK....

>> The win2k DC is on physical machine 1 which is on port 9 of the switch,
>> I've included the above stats here as well:
>>
>> Int  PacketsRX  ErrorsRX  BroadcastRX PacketsTX  ErrorsTX  BroadcastTX
>> 5    734007958  0         110         120729310  0         0
>> 6    733085348  0         114         54059704   0         0
>> 7    734264296  0         113         45917956   0         0
>> 8    732964685  0         102         95655835   0         0
> 
>> 9    1808508983 0         72998       1942345594 0         0
> 
> And here the problem is brightly revealed.  This W2K DC box on port 9
> hosting the shares for the terminal services users appears to be
> funneling all of your file IO to/from the storage server via iSCSI, and
> to/from the terminal servers via CIFS-- all over a single GbE interface.
>  Normally this wouldn't be a big problem.  But you have users copying
> 50GB files over the network, to terminal server machines no less.
> 
> As seen from the switch metrics, when a user does a large file copy from
> a share on one iSCSI target to a share on another iSCSI target, here is
> what is happening:
> 
> 1.  The W2K DC share server pulls the filesystem blocks over iSCSI
> 2.  The storage server pushes the packets out round robin at 4x the rate
>     that the DC can accept them, saturating its receive port
> 3.  The switch issues back offs to the server NICs during the entire
>     length of the copy operation due to the 4:1 imbalance.  The server
>     is so over powered with SSD and 4x GbE links this doesn't bog it
>     down, but it does give us valuable information as to the problem
> 4.  The DC upon receiving the filesystem blocks immediately transmits
>     them back to the other iSCSI target on the storage server

Another possible use case would send them off over SMB to the terminal
server, and potentially that terminal server would send it back to the
storage server.

> 5.  Now the DC's transmit interface is saturated
> 6.  So now both Tx/Rx ports on the DC NIC are saturated
> 7.  Now all CIFS traffic on all terminal servers is significantly
>     delayed due to congestion at the DC, causing severe lag for others
>     doing file operations to/from the DC shares.
> 8.  If the TS/roaming profiles are on a share on this DC server
>     any operation touching a profile will be slow, especially
>     logon/off, as your users surely have massive profiles, given
>     they save multi GB files to their desktops

OK, makes sense ...

>> 802.3x Pause Frames Transmitted		1230476
> "Bingo" metric.
> 
>> 2) The value for Pause Frames Transmitted, I'm not sure what this is,
>> but it doesn't sound like a good thing....
>> http://en.wikipedia.org/wiki/Ethernet_flow_control
>> Seems to indicate that the switch is telling the physical machine to
>> slow down sending data, and if these happen at even time intervals, then
>> that is an average of one per second for the past 16 days.....
> 
> The average is irrelevant.  The switch only sends pauses to the storage
> server NICs when they're transmitting more frames/sec than the single
> port to which the DC is attached can forward them.  More precisely,
> pauses are issued every time the buffer on switch port 9 is full when
> ports 5-8 attempt to forward a frame.  The buffer will be full because
> the downstream GbE NIC can't swallow the frames fast enough.  You've got
> 1.2 million of these pause frames logged.  This is your beacon in the
> dark, shining bright light on the problem.
> 
>> I can understand that the storage server can send faster that any
>> individual receiver, so I can see why the switch might tell it to slow
>> down, but I don't see why the switch would tell the physical machine to
>> slow down.
> 
> It's not telling the "physical machine" to "slow down".  It's telling
> the ethernet device to pause between transmissions to the target MAC
> address which is connected to the switch port that is under load
> distress.  Your storage server isn't slowing down your terminal servers
> or the users apps running on them.  Your DC is.
> 
>> So, to summarise, I think I need to look into the network performance,
> 
> You just did, and helped put the final nail in the coffin.  You simply
> didn't realize it.  And you may balk at the solution, as it is so
> simple, and cheap.  The problem, and the solution are:
> 
> Problem:
> W2K DC handles all the client CIFS file IO traffic with the terminal
> servers, as well as all iSCSI IO to/from the storage server, over a
> single GbE interface.  It has a 4:1 ethernet bandwidth deficit with the
> storage server alone, causing massive network congestion at the DC
> machine during large file transfers.  This in turn bogs down CIFS
> traffic across all TS boxen, lagging the users.
> 
> Solution:
> Simply replace the onboard single port GbE NIC in the W2K DC share
> server with an Intel quad port GbE NIC, and configure LACP bonding with
> the switch. Use ALB instead of RR.  Using ALB will prevent the DC share
> server from overwhelming the terminal servers in the same manner the
> storage server is currently doing the DC.  Leave the storage server as RR.
> 
> However, this doesn't solve the problem of one user on a terminal server
> bogging down everyone else on the same TS box if s/he pulls a 50GB file
> to his/her desktop.  But the degradation will now be limited to only
> users on that one TS box.  If you want to mitigate this to a degree, use
> two bonded NIC ports in the TS boxen.  Here you can use RR transmit
> without problems, as 2 ports can't saturate the 4 on the DC's new 4 port
> NIC.  A 50GB transfer will take 4-5 minutes instead of the current 8-10.
>  But my $deity, why are people moving 50GB files across a small biz
> network for Pete's sake...  If this is an ongoing activity, you need to
> look into Windows user level IO limiting so you can prevent one person
> from hogging all the IO bandwidth.  I've never run into this before so
> you'll have to research it.  May be a policy for it if you're lucky.
> I've always handled this kinda thing with a cluestick.  On to the
> solution, or at least most of it.
> 
> http://www.intel.com/content/dam/doc/product-brief/ethernet-i340-server-adapter-brief.pdf
> 
> You want the I340-T4, 4 port copper, obviously.  Runs about $250 USD,
> about $50 less than the I350-T4.  It's the best 4 port copper GbE NIC
> for the money with all the features you need.  You're already using 2x
> I350-T2s in the server so this card will be familiar WRT driver
> configuration, etc.  It's $50 cheaper than the I350-T4 but with all the
> needed features.
> 
> Crap, I just remembered you're using consumer Asus boards for the other
> machines.  I just checked the manual for the Asus M5A88-M and it's not
> clear if anything but a graphics card can be used in the x16 slot...
> 
> So, I'd acquire one 4 port PCIe x4 Intel card, and two of these Intel 2
> port x1 cards (Intel doesn't offer a 2 port x1 card):
> http://www.intel.com/content/dam/www/public/us/en/documents/product-briefs/pro-1000-pt-server-adapter-brief.pdf
> 
> If the 4 port x4 card won't work, use the two single port x1 cards with
> LACP ALB.  In which case you'll also want to switch the NICs on the
> iSCSI server to ALB, or you'll still have switch congestion.  The 4 port
> 400MB/s solution would be optimal, but 200MB/s is still double what you
> have now, and will help alleviate the problem, but won't eliminate it.
> I hope the 4 port PCIe x4 card will work in that board.
> 
> If you must use the PCIe x1 single port cards, you could try adding a
> PRO 1000 PCI NIC, and Frankenstein these 3 together with the onboard
> Realtek 8111 to get 4 ports.  That's uncharted territory for me.  I
> always use matching NICs, or at least all from the same hardware family
> using the same driver.

Since I'm about to commit significant surgery on the network
infrastructure, I might as well get this right. I did always have the
desire to separate the iSCSI network from the SMB/user traffic network
anyway.

BTW, would I probably see improved stability (ie, reduced performance,
but less errors) by reducing the number of ethernet ports on the storage
server to 2 ? Not a permanent solution, but potentially a very short
term improvement while waiting for parts....

If I added the 4 port card to the DC machine, and a dual port card to
each of the other machines, that means I have:
4 ports on SAN1
4 ports on SAN2
4 ports on DC
2 ports on each other box (7)

Total of 26 ports

I then need to get a new switch, a 24 port switch is not enough, and 48
ports seems overkill. Would be nice to have a spare port for "management
access" as well. Also I guess the switch needs to support a very busy
network...

Move the iSCSI network to a new IP range, and dedicate these network
interfaces for iSCSI.

I could then use the existing onboard 1Gbps ethernet on the machines for
the user level connectivity/SMB/RDP/etc, on the existing switch/etc.
Also, I can use the existing onboard 1G ports on the storage server for
talking to the user level network/management/etc.
That would free up 8 ports on the existing switch (removing the 2 x 4
ports on SAN1/2).

This would also allow up to 1Gbps SMB data transfers between the
machines, although I suppose a single TS can consume 100% of the DC
bandwidth, but I think this is not unusual, and should work OK if
another TS wants to do some small transfer at the same time.

So, purchase list becomes:
1 x 4port ethernet card $450 each
7 x 2port ethernet card $161 each
1 x 48 port switch (any suggestions?) $600

Total Cost: $2177

> I hope I've provided helpful information.

Definitely...

Just in case the budget dollars doesn't stretch that far, would it be a
reasonable budget option to do this:
Add 1 x 2port ethernet card to the DC machine
Add 7 x 1port ethernet card to the rest of the machines $32 (Intel Pro
1000GT DT Adapter I 82541PI Low-Profile PCI)
Add 1 x 24port switch $300

Total Cost: $685

I'm assuming this would stop sharing SMB/iSCSI on the same ports, and
improve the ability for the TS machines to at least talk to the DC and
know the IO is "in progress" and hence reduce the data loss/failures?

> Keep us posted.

Will do, I'll have to price up the above options, and get approval for
purchase, and then will take a few days to get it all in place/etc...
Thank you very much for all the very useful assistance.

Regards,
Adam

-- 
Adam Goryachev
Website Managers
www.websitemanagers.com.au

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance
  2013-02-08 13:58           ` Adam Goryachev
@ 2013-02-08 21:42             ` Stan Hoeppner
  2013-02-14 22:42               ` Chris Murphy
  2013-02-17  9:52             ` RAID performance - new kernel results Adam Goryachev
  1 sibling, 1 reply; 131+ messages in thread
From: Stan Hoeppner @ 2013-02-08 21:42 UTC (permalink / raw)
  To: Adam Goryachev; +Cc: Dave Cundiff, linux-raid

On 2/8/2013 7:58 AM, Adam Goryachev wrote:

> Firstly, this is done against /tmp which is on the single standalone
> Intel SSD used for the rootfs (shows some performance level of the
> chipset I presume):

The chipset performance shouldn't be an issue, but it's possible.

> root@san1:/tmp/testing# fio /root/test.fio
> seq-read: (g=0): rw=read, bs=64K-64K/64K-64K, ioengine=libaio, iodepth=32
> seq-write: (g=1): rw=write, bs=64K-64K/64K-64K, ioengine=libaio, iodepth=32
> Starting 2 processes
> seq-read: Laying out IO file(s) (1 file(s) / 4096MB)
> Jobs: 1 (f=1): [_W] [100.0% done] [0K/137M /s] [0/2133 iops] [eta 00m:00s]
> seq-read: (groupid=0, jobs=1): err= 0: pid=4932
>   read : io=4096MB, bw=518840KB/s, iops=8106, runt=  8084msec
> seq-write: (groupid=1, jobs=1): err= 0: pid=5138
>   write: io=4096MB, bw=136405KB/s, iops=2131, runt= 30749msec
> Run status group 0 (all jobs):
>    READ: io=4096MB, aggrb=518840KB/s, minb=531292KB/s, maxb=531292KB/s,
> mint=8084msec, maxt=8084msec
> 
> Run status group 1 (all jobs):
>   WRITE: io=4096MB, aggrb=136404KB/s, minb=139678KB/s, maxb=139678KB/s,
> mint=30749msec, maxt=30749msec
> 
> Disk stats (read/write):
>   sda: ios=66570/66363, merge=10297/10453, ticks=259152/993304,
> in_queue=1252592, util=99.34%
...
> This seems to indicate a read speed of 531M and write of 139M, which to
> me says something is wrong. I thought write speed is slower, but not
> that much slower?

Study this:
http://www.anandtech.com/show/5508/intel-ssd-520-review-cherryville-brings-reliability-to-sandforce/3

That's the 240GB version of your 520S.  Note the write tests are all
well over 300MB/s, one seq write test reaching almost 400MB/s.  The
480GB version should be even better.  These tests use 4KB *aligned* IOs.
 If you've partitioned the SSDs, and your partition boundaries fall in
the middle of erase blocks instead of perfectly between them, then your
IOs will be unaligned, and performance will suffer.  Considering the
numbers you're seeing with FIO this may be part of the low performance
problem.

> Moving on, I've stopped the secondary DRBD, created a new LV (testlv) of
> 15G, and formatted with ext4, mounted it, and re-run the test:
> 
> seq-read: (groupid=0, jobs=1): err= 0: pid=19578
>   read : io=4096MB, bw=640743KB/s, iops=10011, runt=  6546msec
> seq-write: (groupid=1, jobs=1): err= 0: pid=19997
>   write: io=4096MB, bw=208765KB/s, iops=3261, runt= 20091msec
> Run status group 0 (all jobs):
>    READ: io=4096MB, aggrb=640743KB/s, minb=656120KB/s, maxb=656120KB/s,
> mint=6546msec, maxt=6546msec
> 
> Run status group 1 (all jobs):
>   WRITE: io=4096MB, aggrb=208765KB/s, minb=213775KB/s, maxb=213775KB/s,
> mint=20091msec, maxt=20091msec
> 
> Disk stats (read/write):
>   dm-14: ios=65536/64841, merge=0/0, ticks=206920/469464,
> in_queue=676580, util=98.89%, aggrios=0/0, aggrmerge=0/0, aggrticks=0/0,
> aggrin_queue=0, aggrutil=0.00%
>     drbd2: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=-nan%
> 
> dm-14 is the testlv
> 
> So, this indicates a max read speed of 656M and write of 213M, again,
> write is very slow (about 30%).
> 
> With these figures, just 2 x 1Gbps links would saturate the write
> performance of this RAID5 array.

You might get close if you ran a synthetic test, but you wouldn't
bottleneck at the array with CIFS traffic from that DC.  Now, once you
get the network problems straightened out, then you may bottleneck the
SSDs with multiple large sequential writes.  Assuming you don't get the
block IO issues fixed.

I recommend blowing away the partitions entirely and building your md
array on bare drives.  As part of this you will have to recreate your
LVs you're exporting via iSCSI.  Make sure all LVs are aligned to the
underlying md device geometry.  This will eliminate any possible
alignment issues.

Whether it does or not, given what I've learned of this environment, I'd
go ahead and install one of the LSI 9207-8i 6GB/s SAS/SATA HBAs I
mentioned earlier, in SLOT6 for full bandwidth, and move all the SSDs in
the array over to it.  This will give you 600MB/s peak bandwidth per SSD
eliminating any possible issues created by running them at SATA2 link
speed, and eliminate any possible issues with the C204 Southbridge chip,
while giving you substantially higher controller IOPS: 700,000.  If the
SSDs are not on a chassis backplane you'll need two SFF8088 breakout
cables to connect the drives to the card.  The "kit" version of these
cards comes with these cables.  Kit runs ~$350 USD.

> Finally, changing the fio config file to point filename=/dev/vg0/testlv
> (ie, raw LV, no filesystem):
> seq-read: (groupid=0, jobs=1): err= 0: pid=10986
>   read : io=4096MB, bw=652607KB/s, iops=10196, runt=  6427msec
> seq-write: (groupid=1, jobs=1): err= 0: pid=11177
>   write: io=4096MB, bw=202252KB/s, iops=3160, runt= 20738msec
> Run status group 0 (all jobs):
>    READ: io=4096MB, aggrb=652606KB/s, minb=668269KB/s, maxb=668269KB/s,
> mint=6427msec, maxt=6427msec
> 
> Run status group 1 (all jobs):
>   WRITE: io=4096MB, aggrb=202252KB/s, minb=207106KB/s, maxb=207106KB/s,
> mint=20738msec, maxt=20738msec
> 
> Not much difference, which I didn't really expect...
> 
> So, should I be concerned about these results? Do I need to try to
> re-run these tests at a lower layer (ie, remove DRBD and/or LVM from the
> picture)? Are these meaningless and I should be running a different
> test/set of tests/etc ?

The ~200MB/s seq writes is a bit alarming, as is the ~500MB/s read rate.
 5 SSDs in RAID5 should be able to do much much more, especially reads.
 Theoretically you should be able to squeeze 2GB/s of read speed out of
this RAID5.  But given this is a RAID5 array, write will always be
slower, even with SSD.  But they shouldn't be this much slower with SSD
because the RMW latency is so much lower.  But with large sequential
writes you shouldn't have RMW cycles anyway.  If DRBD is mirroring the
md/RAID5 device it will definitely skew your test results lower, but not
drastically so.  I can't recall if you stated the size of your md stripe
cache.  If it's too small that may be hurting performance.

Something we've only briefly touched on so far is the single write
thread bottleneck of the md/RAID5 driver.  To verify if this is part of
this problem you need to capture CPU core utilization during your write
tests to see if md is eating all of one core.  If it is then your RAID5
speed will never get better on this mobo/CPU combo, until you upgrade to
a kernel with the appropriate patches.  But at only 200MB/s I doubt this
is the case, but check it anyway.  Once you get the IO problem fixed you
may run into the single thread problem, so you'll check this again at
that time.

IIRC, people on this list are hitting ~400-500MB/s sequential writes
with RAID5/6/10 rust arrays, so I don't think the write thread is your
problem.  Not yet anyway.

-- 
Stan


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance
  2013-02-08 18:44             ` Adam Goryachev
@ 2013-02-09  4:09               ` Stan Hoeppner
  2013-02-10  4:40                 ` Adam Goryachev
  0 siblings, 1 reply; 131+ messages in thread
From: Stan Hoeppner @ 2013-02-09  4:09 UTC (permalink / raw)
  To: Adam Goryachev; +Cc: Dave Cundiff, linux-raid

Normally I'd trim post as this is so absolutely huge, but I want to keep
this thread intact for people stumbling across this via Google.  I think
it's very informative/educational to read this troubleshooting
progression, and gain the insights and knowledge contained herein.

On 2/8/2013 12:44 PM, Adam Goryachev wrote:
> On 09/02/13 04:10, Stan Hoeppner wrote:
>> The information you've provided below seems to indicate the root cause
>> of the problem.  The good news is that the fix(es) are simple, and
>> inexpensive.
>>
>> I must say, now that I understand the problem, I'm wondering why you
>> used 4 bonded GbE ports on your iSCSI target server, yet employed a
>> single GbE port on the only machine that accesses it, according to the
>> information you've presented.  Based on that, this is the source of your
>> problem.  Keep reading.
> 
> Well, because the old SAN device had 4 x Gbps ports, and I copied that,
> and I also didn't want an individual PC to flood the SAN... I guess I
> never worked out that one PC was really driving 70% of the traffic....

And that was smart thinking.  You simply didn't realize that one TS
could now flood the CIFS server.

>>> From the switch stats, ports 5 to 8 are the bonded ports on the storage
>>> server (iSCSI traffic):
>>>
>>> Int  PacketsRX  ErrorsRX  BroadcastRX PacketsTX  ErrorsTX  BroadcastTX
>>> 5    734007958  0         110         120729310  0         0
>>> 6    733085348  0         114         54059704   0         0
>>> 7    734264296  0         113         45917956   0         0
>>> 8    732964685  0         102         95655835   0         0
>>
>> I'm glad I asked you for this information.  This clearly shows that the
>> server is performing LACP round robin fanning nearly perfectly.  It also
>> shows that the bulk of the traffic coming from the W2K DC, which
>> apparently hosts the Windows shares for TS users, is being pumped to the
>> storage server over port 5, the first port in the switch's bonding
>> group.  The switch is doing adaptive load balancing with transmission
>> instead of round robin.  This is the default behavior of many switches
>> and is fine.
> 
> Is there some method to fix this on the switch? I have configured the
> switch that those 4 ports are a single LAG, which I assumed meant the
> switch would be smart enough to load balance properly... Guess I never
> checked that side of it though...

After thinking this through more thoroughly, I realize your IO server
may be doing broadcast aggregation and not round robin.  However, in
either case this is bad, as it will cause out of order packets or
duplicate packets.  Both of these are wrong for your network
architecture and will cause problems.  RR will cause TCP packets to be
reassembled out of sequence, causing extra overhead at the receiver, and
possibly errors if not reassembled in correct order.  Broadcast will
cause duplicate packets to arrive, at the receiver, which must discard
them.  Both flood the receiver's switch port.

The NIC ports on the IO server need to be configured as 802.3ad Dynamic
if using the Linux bonding driver.  If you're using the Intel driver
LACP it should be set to this as well, though the name may be different.

Round robin fanning of frames across all 4 ports evenly seems like a
good idea on paper, until you dig into the 802.3ad protocol:

http://www.ieee802.org/3/hssg/public/apr07/frazier_01_0407.pdf

Once you do you realize (for me again, as it's been a while) that any
single session will be limited by default to a single physical link of
the group.  LACP only gives increased bandwidth across links when
multiple sessions are present.  This is done to preserve proper packet
ordering per session, which is corrupted when fanning packets of a
single session across all links.  Fanning, (round robin) is only meant
to be used in multi-switch setups where each host has a NIC link to each
switch--I.e. Beowulf clusters.  In the default Dynamic mode, you don't
have the IO server flooding the DC with more packets than it can handle,
because the two hosts will be communicating over the same link(s), no
more, so bandwidth and packet volume is equal between them.

So, you need to disable RR or broadcast, whichever it is currently, on
the IO server, and switch it to Dynamic mode.  This will instantly kill
the flooding problem, stop the switch from sending PAUSE frames to the
IO server, and might eliminate the file/IO errors.  I'm not sure on this
last one, as I've not seen enough information about the errors (or the
actual errors themselves).  That said, disabling the Windows write
caching on the local drives backed by the iSCSI LUNs might fix this as
well.  It should never be left enabled in a configuration such as yours.

>>> So, traffic seems reasonably well balanced across all four links
>>
>> The storage server's transmit traffic is well balanced out of the NICs,
>> but the receive traffic from the switch is imbalanced, almost 3:1
>> between ports 5 and 7.  This is due to the switch doing ALB, and helps
>> us diagnose the problem.
> 
> The switch doesn't seem to have any setting to configure ALB or RR, or
> at least I don't know what I'm looking for.... In any case, I suppose if
> both sides of the network have equivalent bandwidth, then it should be
> OK....

Let's see, I think you listed the switch model...  yes, GS716T-200

It does stock 802.3ad static and dynamic link aggregation, dynamic by
default it appears, so standard session based streams.  This is what you
want.

Ah, here you go.  It does have port based ingress/egress rate limiting.
 So you should be able to slow down the terminal server hosts so no
single one can flood the DC.  Very nice.  I wouldn't have expected this
in this class of switch.

So, you can fix the network performance problem without expending any
money.  You'll just have on TS host and its users bogged down when
someone does a big file copy.  And if you can find a Windows policy to
limit IO per user, you can solve it completely.

That said, I'd still get two or 4 bonded ports into that DC share server
to speed things up for everyone.

>>> The win2k DC is on physical machine 1 which is on port 9 of the switch,
>>> I've included the above stats here as well:
>>>
>>> Int  PacketsRX  ErrorsRX  BroadcastRX PacketsTX  ErrorsTX  BroadcastTX
>>> 5    734007958  0         110         120729310  0         0
>>> 6    733085348  0         114         54059704   0         0
>>> 7    734264296  0         113         45917956   0         0
>>> 8    732964685  0         102         95655835   0         0
>>
>>> 9    1808508983 0         72998       1942345594 0         0
>>
>> And here the problem is brightly revealed.  This W2K DC box on port 9
>> hosting the shares for the terminal services users appears to be
>> funneling all of your file IO to/from the storage server via iSCSI, and
>> to/from the terminal servers via CIFS-- all over a single GbE interface.
>>  Normally this wouldn't be a big problem.  But you have users copying
>> 50GB files over the network, to terminal server machines no less.
>>
>> As seen from the switch metrics, when a user does a large file copy from
>> a share on one iSCSI target to a share on another iSCSI target, here is
>> what is happening:
>>
>> 1.  The W2K DC share server pulls the filesystem blocks over iSCSI
>> 2.  The storage server pushes the packets out round robin at 4x the rate
>>     that the DC can accept them, saturating its receive port
>> 3.  The switch issues back offs to the server NICs during the entire
>>     length of the copy operation due to the 4:1 imbalance.  The server
>>     is so over powered with SSD and 4x GbE links this doesn't bog it
>>     down, but it does give us valuable information as to the problem
>> 4.  The DC upon receiving the filesystem blocks immediately transmits
>>     them back to the other iSCSI target on the storage server
> 
> Another possible use case would send them off over SMB to the terminal
> server, and potentially that terminal server would send it back to the
> storage server.

Yeah, I skipped listing the CIFS client host in the traffic chain, as
once the DC is flooded all the TS servers crawl.

>> 5.  Now the DC's transmit interface is saturated
>> 6.  So now both Tx/Rx ports on the DC NIC are saturated
>> 7.  Now all CIFS traffic on all terminal servers is significantly
>>     delayed due to congestion at the DC, causing severe lag for others
>>     doing file operations to/from the DC shares.
>> 8.  If the TS/roaming profiles are on a share on this DC server
>>     any operation touching a profile will be slow, especially
>>     logon/off, as your users surely have massive profiles, given
>>     they save multi GB files to their desktops
> 
> OK, makes sense ...
> 
>>> 802.3x Pause Frames Transmitted		1230476
>> "Bingo" metric.
>>
>>> 2) The value for Pause Frames Transmitted, I'm not sure what this is,
>>> but it doesn't sound like a good thing....
>>> http://en.wikipedia.org/wiki/Ethernet_flow_control
>>> Seems to indicate that the switch is telling the physical machine to
>>> slow down sending data, and if these happen at even time intervals, then
>>> that is an average of one per second for the past 16 days.....
>>
>> The average is irrelevant.  The switch only sends pauses to the storage
>> server NICs when they're transmitting more frames/sec than the single
>> port to which the DC is attached can forward them.  More precisely,
>> pauses are issued every time the buffer on switch port 9 is full when
>> ports 5-8 attempt to forward a frame.  The buffer will be full because
>> the downstream GbE NIC can't swallow the frames fast enough.  You've got
>> 1.2 million of these pause frames logged.  This is your beacon in the
>> dark, shining bright light on the problem.
>>
>>> I can understand that the storage server can send faster that any
>>> individual receiver, so I can see why the switch might tell it to slow
>>> down, but I don't see why the switch would tell the physical machine to
>>> slow down.
>>
>> It's not telling the "physical machine" to "slow down".  It's telling
>> the ethernet device to pause between transmissions to the target MAC
>> address which is connected to the switch port that is under load
>> distress.  Your storage server isn't slowing down your terminal servers
>> or the users apps running on them.  Your DC is.
>>
>>> So, to summarise, I think I need to look into the network performance,
>>
>> You just did, and helped put the final nail in the coffin.  You simply
>> didn't realize it.  And you may balk at the solution, as it is so
>> simple, and cheap.  The problem, and the solution are:
>>
>> Problem:
>> W2K DC handles all the client CIFS file IO traffic with the terminal
>> servers, as well as all iSCSI IO to/from the storage server, over a
>> single GbE interface.  It has a 4:1 ethernet bandwidth deficit with the
>> storage server alone, causing massive network congestion at the DC
>> machine during large file transfers.  This in turn bogs down CIFS
>> traffic across all TS boxen, lagging the users.
>>
>> Solution:
>> Simply replace the onboard single port GbE NIC in the W2K DC share
>> server with an Intel quad port GbE NIC, and configure LACP bonding with
>> the switch. Use ALB instead of RR.  Using ALB will prevent the DC share
>> server from overwhelming the terminal servers in the same manner the
>> storage server is currently doing the DC.  Leave the storage server as RR.
>>
>> However, this doesn't solve the problem of one user on a terminal server
>> bogging down everyone else on the same TS box if s/he pulls a 50GB file
>> to his/her desktop.  But the degradation will now be limited to only
>> users on that one TS box.  If you want to mitigate this to a degree, use
>> two bonded NIC ports in the TS boxen.  Here you can use RR transmit
>> without problems, as 2 ports can't saturate the 4 on the DC's new 4 port
>> NIC.  A 50GB transfer will take 4-5 minutes instead of the current 8-10.
>>  But my $deity, why are people moving 50GB files across a small biz
>> network for Pete's sake...  If this is an ongoing activity, you need to
>> look into Windows user level IO limiting so you can prevent one person
>> from hogging all the IO bandwidth.  I've never run into this before so
>> you'll have to research it.  May be a policy for it if you're lucky.
>> I've always handled this kinda thing with a cluestick.  On to the
>> solution, or at least most of it.
>>
>> http://www.intel.com/content/dam/doc/product-brief/ethernet-i340-server-adapter-brief.pdf
>>
>> You want the I340-T4, 4 port copper, obviously.  Runs about $250 USD,
>> about $50 less than the I350-T4.  It's the best 4 port copper GbE NIC
>> for the money with all the features you need.  You're already using 2x
>> I350-T2s in the server so this card will be familiar WRT driver
>> configuration, etc.  It's $50 cheaper than the I350-T4 but with all the
>> needed features.
>>
>> Crap, I just remembered you're using consumer Asus boards for the other
>> machines.  I just checked the manual for the Asus M5A88-M and it's not
>> clear if anything but a graphics card can be used in the x16 slot...
>>
>> So, I'd acquire one 4 port PCIe x4 Intel card, and two of these Intel 2
>> port x1 cards (Intel doesn't offer a 2 port x1 card):
>> http://www.intel.com/content/dam/www/public/us/en/documents/product-briefs/pro-1000-pt-server-adapter-brief.pdf
>>
>> If the 4 port x4 card won't work, use the two single port x1 cards with
>> LACP ALB.  In which case you'll also want to switch the NICs on the
>> iSCSI server to ALB, or you'll still have switch congestion.  The 4 port
>> 400MB/s solution would be optimal, but 200MB/s is still double what you
>> have now, and will help alleviate the problem, but won't eliminate it.
>> I hope the 4 port PCIe x4 card will work in that board.
>>
>> If you must use the PCIe x1 single port cards, you could try adding a
>> PRO 1000 PCI NIC, and Frankenstein these 3 together with the onboard
>> Realtek 8111 to get 4 ports.  That's uncharted territory for me.  I
>> always use matching NICs, or at least all from the same hardware family
>> using the same driver.
> 
> Since I'm about to commit significant surgery on the network
> infrastructure, I might as well get this right. I did always have the
> desire to separate the iSCSI network from the SMB/user traffic network
> anyway.

Not necessary, won't gain you anything, now that you know how to
configure your current gear, or at least, that it can be configured to
meet your needs, solving your current problems.

> BTW, would I probably see improved stability (ie, reduced performance,
> but less errors) by reducing the number of ethernet ports on the storage
> server to 2 ? Not a permanent solution, but potentially a very short
> term improvement while waiting for parts....

Nope, just change the bonding mode on the IO server to standard LAPC
Dynamic, as I stated above, and this is all fixed.

> If I added the 4 port card to the DC machine, and a dual port card to
> each of the other machines, that means I have:
> 4 ports on SAN1
> 4 ports on SAN2
> 4 ports on DC
> 2 ports on each other box (7)
> 
> Total of 26 ports

Add the 4 port to the DC if it'll work in the x16 slot, if not use two
of the single port PCIe x1 NICs I mentioned and bond them in 802.3ad
Dyaminc mode, same as with the IO server.  Look into Windows TS per user
IO rate limits.  If this capability exists, limit each user to 50MB/s.

And with that, you should have fixed all the network issues.  Combined
with the changes to the IO server, you should be all squared away.

> I then need to get a new switch, a 24 port switch is not enough, and 48
> ports seems overkill. Would be nice to have a spare port for "management
> access" as well. Also I guess the switch needs to support a very busy
> network...

Unneeded additional cost and complexity.

> Move the iSCSI network to a new IP range, and dedicate these network
> interfaces for iSCSI.

Unneeded additional cost and complexity.

> I could then use the existing onboard 1Gbps ethernet on the machines for
> the user level connectivity/SMB/RDP/etc, on the existing switch/etc.
> Also, I can use the existing onboard 1G ports on the storage server for
> talking to the user level network/management/etc.
> That would free up 8 ports on the existing switch (removing the 2 x 4
> ports on SAN1/2).

Unneeded additional cost and complexity.

> This would also allow up to 1Gbps SMB data transfers between the
> machines, although I suppose a single TS can consume 100% of the DC
> bandwidth, but I think this is not unusual, and should work OK if
> another TS wants to do some small transfer at the same time.

Already addressed.  Even with only 2 bonded ports on the DC, the most
bandwidth a single TS box can tie up is half.  And if you implement port
level rate limiting of 500Mb/s for each of the 4 TS boxen switch ports
(in/out) you can never flood the DC.

> So, purchase list becomes:
> 1 x 4port ethernet card $450 each
> 7 x 2port ethernet card $161 each
> 1 x 48 port switch (any suggestions?) $600
> 
> Total Cost: $2177

Again, you have all the network hardware you need, so this is completely
unnecessary.  You just need to get what you have configured correctly.

>> I hope I've provided helpful information.
> 
> Definitely...

Everything above should be even more helpful.  My apologies for not
having precise LACP insight in my previous post.  It's been quite a
while and I was rusty, and didn't have time to refresh my knowledge base
before the previous post.

> Just in case the budget dollars doesn't stretch that far, would it be a
> reasonable budget option to do this:
> Add 1 x 2port ethernet card to the DC machine
> Add 7 x 1port ethernet card to the rest of the machines $32 (Intel Pro
> 1000GT DT Adapter I 82541PI Low-Profile PCI)
> Add 1 x 24port switch $300
> 
> Total Cost: $685

If the DC can take a PCIe x4 dual port card, that should work fine with
the reconfiguration I described above.  The rest of the gear in that
$685 is wasted--no gain.  Use part of the remaining balance for the LSI
9207-8i HBA.  That will make a big difference in throughput once you get
alignment and other issues identified and corrected, more than double
your current bandwidth and IOPS, making full time DRBD possible.

> I'm assuming this would stop sharing SMB/iSCSI on the same ports, and
> improve the ability for the TS machines to at least talk to the DC and
> know the IO is "in progress" and hence reduce the data loss/failures?

Again this is all unnecessary once you implement the aforementioned
changes.  If the IO errors on the TS machines still occur the cause
isn't in the network setup.  Running CIFS(SMB)/iSCSI on the same port is
done 24x7 by thousands of sites.  This isn't the cause of the TS IO
errors.  Congestion alone shouldn't cause them either, unless a Windows
kernel iSCSI packet timeout is being exceeded or something like that,
which actually seems pretty plausible given the information you've
provided.  I admit I'm no a Windows iSCSI expert.  If that is the case
then it should be solved by the mentioned LACP configuration and two
bonded ports on the DC box.

>> Keep us posted.
> 
> Will do, I'll have to price up the above options, and get approval for
> purchase, and then will take a few days to get it all in place/etc...

Given the temperature under the collar of the client, I'd simply spend
on adding the 2 bonded ports to the DC box, make all of the LACP
changes, and straighten out alignment/etc issues on the SSDs, md stripe
cache, etc.  This will make substantial gains.  Once the client sees the
positive results, then recommend the HBA for even better performance.
Remember, Intel's 520 SSD data shows nearly double the performance using
SATA3 vs SATA2.  Once you have alignment and md tuning squared away,
moving to the LSI should nearly double your block throughput.

> Thank you very much for all the very useful assistance.

You're very welcome Adam.  Note my email domain. ;)  I love this stuff.

-- 
Stan


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance
  2013-02-09  4:09               ` Stan Hoeppner
@ 2013-02-10  4:40                 ` Adam Goryachev
  2013-02-10 13:22                   ` Stan Hoeppner
  0 siblings, 1 reply; 131+ messages in thread
From: Adam Goryachev @ 2013-02-10  4:40 UTC (permalink / raw)
  To: stan; +Cc: Dave Cundiff, linux-raid

Stan Hoeppner <stan@hardwarefreak.com> wrote:
>On 2/8/2013 12:44 PM, Adam Goryachev wrote:
>> On 09/02/13 04:10, Stan Hoeppner wrote:
>>>> From the switch stats, ports 5 to 8 are the bonded ports on the
>storage
>>>> server (iSCSI traffic):
>>>>
>>>> Int  PacketsRX  ErrorsRX  BroadcastRX PacketsTX  ErrorsTX 
>BroadcastTX
>>>> 5    734007958  0         110         120729310  0         0
>>>> 6    733085348  0         114         54059704   0         0
>>>> 7    734264296  0         113         45917956   0         0
>>>> 8    732964685  0         102         95655835   0         0
>>>
>>> I'm glad I asked you for this information.  This clearly shows that
>the
>>> server is performing LACP round robin fanning nearly perfectly.  It
>also
>>> shows that the bulk of the traffic coming from the W2K DC, which
>>> apparently hosts the Windows shares for TS users, is being pumped to
>the
>>> storage server over port 5, the first port in the switch's bonding
>>> group.  The switch is doing adaptive load balancing with
>transmission
>>> instead of round robin.  This is the default behavior of many
>switches
>>> and is fine.
>> 
>> Is there some method to fix this on the switch? I have configured the
>> switch that those 4 ports are a single LAG, which I assumed meant the
>> switch would be smart enough to load balance properly... Guess I
>never
>> checked that side of it though...
>
>After thinking this through more thoroughly, I realize your IO server
>may be doing broadcast aggregation and not round robin.  However, in
>either case this is bad, as it will cause out of order packets or
>duplicate packets.  Both of these are wrong for your network
>architecture and will cause problems.  RR will cause TCP packets to be
>reassembled out of sequence, causing extra overhead at the receiver,
>and
>possibly errors if not reassembled in correct order.  Broadcast will
>cause duplicate packets to arrive, at the receiver, which must discard
>them.  Both flood the receiver's switch port.

They were definitely RR before.

>The NIC ports on the IO server need to be configured as 802.3ad Dynamic
>if using the Linux bonding driver.  If you're using the Intel driver
>LACP it should be set to this as well, though the name may be
>different.
>
>Once you do you realize (for me again, as it's been a while) that any
>single session will be limited by default to a single physical link of
>the group.  LACP only gives increased bandwidth across links when
>multiple sessions are present.  This is done to preserve proper packet
>ordering per session, which is corrupted when fanning packets of a
>single session across all links. In the default Dynamic mode, you don't
>have the IO server flooding the DC with more packets than it can
>handle, because the two hosts will be communicating over the same
> link(s), no more, so bandwidth and packet volume is equal between them.
>
>So, you need to disable RR or broadcast, whichever it is currently, on
>the IO server, and switch it to Dynamic mode.  This will instantly kill
>the flooding problem, stop the switch from sending PAUSE frames to the
>IO server, and might eliminate the file/IO errors.  I'm not sure on
>this
>last one, as I've not seen enough information about the errors (or the
>actual errors themselves).

OK, so I changed the linux iSCSI server to 802.3ad mode, and that killed all networking, so I changed the switch config to use LACP, and then that was working again.
I then tested single physical machine network performance (just a simple dd if=iscsi device of=/dev/null to read a few gig of data. I had some interesting results. Initially, each server individually could read around 120MB/s, so I tried 2 at the same time, and each got 120MB/s, so I tried three at a time, same result. Finally, testing 4 in parallel, two got 120MB/s and the other two got around 60MB/s. Eventually I worked out this:

Server    Switch port
1               6
2               5
3               7
4               7
5               7
6               7
7               7
8               6

So, for some reason, port 8 was never used, (unless I physically disconnected ports 5, 6 and 7). Also, a single port was shared for 5 machines, resulting in around 20MB/s for each (when testing all in parallel).

I eventually changed the iSCSI server to use xmit_hash_policy to 1 (layer3+4) instead of layer2 hashing. This resulted in a minor improvement as follows:
Server    Switch port
1               6
2               5
3               8
4               6
5               6
6               6
7               6
8               7

So now, I still have 5 machines sharing a single port, but the other three get a full port each. I'm not sure why the balancing is so poor... The port number should be the same for all machines (iscsi), but the IP's are consecutive (x.x.x.31 - x.x.x.38).

Anyway, so I've configured the DC on machine 2, the three testing servers and two of the TS on the "shared port" machines, and the third TS and DB server onto the remaining machines.

Any suggestions on how to better balance the traffic would be appreciated!!!

>That said, disabling the Windows write
>caching on the local drives backed by the iSCSI LUNs might fix this as
>well.  It should never be left enabled in a configuration such as
>yours.

Have now done this across all the windows servers for all iSCSI drives, left it enabled for the RAM drive with the pagefile

>>>> So, traffic seems reasonably well balanced across all four links
>>> The storage server's transmit traffic is well balanced out of the
>>>NICs, but the receive traffic from the switch is imbalanced, almost
>>>3:1 between ports 5 and 7.  This is due to the switch doing ALB, and
>>>helps us diagnose the problem.
>> 
>> The switch doesn't seem to have any setting to configure ALB or RR,
>>or at least I don't know what I'm looking for.... In any case, I suppose
>>if both sides of the network have equivalent bandwidth, then it should
>>be OK....
>
>Let's see, I think you listed the switch model...  yes, GS716T-200
>
>It does stock 802.3ad static and dynamic link aggregation, dynamic by
>default it appears, so standard session based streams.  This is what
>you want.

I'm assuming that is what I have now, but I didn't do write tests so I can't be sure the switch will properly balance the traffic back to the server

>Ah, here you go.  It does have port based ingress/egress rate limiting.
> So you should be able to slow down the terminal server hosts so no
>single one can flood the DC.  Very nice.  I wouldn't have expected this
>in this class of switch.

I don't know if I want to do this, as it will also limit SMB, RDP. etc traffic just as much.... I'll leave it for now, and perhaps come back to it if it is still an issue.

>So, you can fix the network performance problem without expending any
>money.  You'll just have on TS host and its users bogged down when
>someone does a big file copy.  And if you can find a Windows policy to
>limit IO per user, you can solve it completely.

I'll look into this later, but this is pretty much acceptable, the main issue is where one machine can impact other machines.

>That said, I'd still get two or 4 bonded ports into that DC share
>server to speed things up for everyone.

OK, I'll need to think about this one carefully. I wanted all the 8 machines to be identical so that we can do live migration of the virtual machines, and also if physical hardware fails, then it is easy to reboot a VM on another physical host. If I add specialised hardware, then it requires the VM to run on that host, (well, would still work on another host with reduced performance, which is somewhat acceptable, but not preferable since might end up trying to fix a hardware failure and a performance issue at the same time, or other random issues related to the reduced performance.

>Add the 4 port to the DC if it'll work in the x16 slot, if not use two
>of the single port PCIe x1 NICs I mentioned and bond them in 802.3ad
>Dyaminc mode, same as with the IO server.  Look into Windows TS per
>user IO rate limits.  If this capability exists, limit each user to 50MB/s.
>
>And with that, you should have fixed all the network issues.  Combined
>with the changes to the IO server, you should be all squared away.

OK, so apparently the motherboard on the physical machines will work fine with the dual or quad ethernet cards.

I'm not sure how this solves the problem though.

1) TS user asks the DC to copy file1 from the shareA to shareA in a different folder
2) TS user asks the DC to copy file1 from the shareA to shareB
3) TS user asks the DC to copy file1 from the shareA to local drive C:

In cases 1 and 2, I assume the DC will not actually send the file content over SMB, it will just do the copy locally, but the DC will read from the SAN at single ethernet speed and write to the san  at single ethernet speed,  since even if the DC uses RR to send the data at 2x1Gbps, the switch is LACP so will forward to the iSCSI server at 1Gbps. Hence, iSCSI is maxed out at 1Gbps... The iSCSI potentially can satisfy other servers if LACP is not making them share the same ethernet. The DC can possibly, if LACP happens to choose the second port, be able to maintain SMB/RDP traffic. but if LACP shares the same port, then the second ethernet is wasted.

Regardless of what number of network ports are on the physical machines, the SAN will only send/receive at a max of 1G per machine, so the DC is still limited to 1G total iSCSI bandwidth. If I use RR on the DC, then it has 2G write and only 1G read performance, which seems strange.

The more I think about this, the worse it seems to get... It almost seems I should do this:
1) iSCSI uses RR and switch uses LAG (LACP)
2) All physical machines have a dual ethernet and use RR, and the switch uses LAG (LACP)
3) On the iSCSI server, I configure some sort of bandwidth shaping, so that the DC gets 2Gbps, and all other machines get 1Gbps
4) On the physical machines, I configure some sort of bandwidth shaping so that all VM's other than the DC get limited to 1Gbps

This seems like a horrible, disgusting hack, and I would really hate myself for trying to implement it, and I don't know that Linux will be good at limiting speeds this fast including CPU overhead concerns, etc

I'm in a mess here, and not sure any of this makes sense...

How about:
1) Add dual port ethernet to each physical box
2) Use the dual port ethernet in RR to connect to the iSCSI
3) Use the onboard ethernet for the user network
4) Configure the iSCSI server in RR again

This means the TS and random desktop's get a full 1Gbps for SMB access, the same as they had when it was a physical machine
The DC gets a full 2Gbps access to the iSCSI server, the iSCSI server might send/flood the link, but I assume since there is only iSCSI traffic, we don't care.
The TS can also do 2Gbps to the iSCSI server, but again this is OK because the iSCSI has 4Gbps available
If a user copies a large file from the DC to local drive, it floods the 1G user LAN with SMB, which uses only 1Gbps on the iSCSI LAN for the DC, and 1Gbps for the TS on the iSCSI LAN (total 2Gbps on the iSCSI SAN).

To make this work, I need 8 x dual port cards, or in reality, 2 x 4port cards plus 4 x 2port cards (putting 4port cards into the san, and moving existing 2port cards), then I need a 48 port switch to connect everything up, and then I'm finished.

Add SATA card to the SAN, and I'm laughing.... sure, it's a chunk of new hardware, but it just doesn't seem to work right any other way I think about it.

So, purchase list becomes:
2 x 4port ethernet card $450 each
4 x 2port ethernet card $161 each
1 x 48 port switch (any suggestions?) $600
2 x LSI HBA  $780
Total Cost: $2924

>Again, you have all the network hardware you need, so this is
>completely unnecessary.  You just need to get what you have
>configured correctly.

>Everything above should be even more helpful.  My apologies for not
>having precise LACP insight in my previous post.  It's been quite a
>while and I was rusty, and didn't have time to refresh my knowledge
>base before the previous post.

I don't see how LACP will make it better, well, it will stop sending pause commands, but other than that, it seems to limit the bandwidth to even less than 1Gbps. The question was asked if it would be worthwhile to just upgrade to 10Gbps network for all machines.... I haven't looked at costing on that option, but I assume it is really just the same problem anyway, either speeds are unbalanced if server has more bandwidth, or speeds are balanced if server has equal bandwidth/limited balancing with LACP asiide)

BTW, reading at www.kernel.org/doc/Documentation/networking/bonding.txt in chapter 12.1.1 I think maybe balance-alb might be a better solution? It sounds like it would at least do a better job at avoiding 5 machines being on the same link .... 

>> Just in case the budget dollars doesn't stretch that far, would it be
>> a reasonable budget option to do this:
>> Add 1 x 2port ethernet card to the DC machine
>> Add 7 x 1port ethernet card to the rest of the machines $32 (Intel
>> Pro 1000GT DT Adapter I 82541PI Low-Profile PCI)
>> Add 1 x 24port switch $300
>> 
>> Total Cost: $685
>
>If the DC can take a PCIe x4 dual port card, that should work fine with
>the reconfiguration I described above.  The rest of the gear in that
>$685 is wasted--no gain.  Use part of the remaining balance for the LSI
>9207-8i HBA.  That will make a big difference in throughput once you
>get alignment and other issues identified and corrected, more than double
>your current bandwidth and IOPS, making full time DRBD possible.

I will suggest the HBA anyway, might as well improve that now anyway, and it also adds options for future expansion (up to 8 x SSD's). 

I can't find that exact one, my supplier has suggested the LSI SAS 9211-8i pack for $390 or the LSI MegaRAID SAS 9240-8i pack for $429, is one of these equivalent/comparable?

>> I'm assuming this would stop sharing SMB/iSCSI on the same ports, and
>> improve the ability for the TS machines to at least talk to the DC
>and
>> know the IO is "in progress" and hence reduce the data loss/failures?
>
>Again this is all unnecessary once you implement the aforementioned
>changes.  If the IO errors on the TS machines still occur the cause
>isn't in the network setup.  Running CIFS(SMB)/iSCSI on the same port
>is
>done 24x7 by thousands of sites.  This isn't the cause of the TS IO
>errors.  Congestion alone shouldn't cause them either, unless a Windows
>kernel iSCSI packet timeout is being exceeded or something like that,
>which actually seems pretty plausible given the information you've
>provided.  I admit I'm no a Windows iSCSI expert.  If that is the case
>then it should be solved by the mentioned LACP configuration and two
>bonded ports on the DC box.

I suspect a part of all this was caused by the write caching on the windows drives, so hopefully that situation will improve now.

When doing the above dd tests, I noticed one machine would show 2.6GB/s for the second or subsequent reads (ie, cached) while all the other machines would show consistent read speeds equivalent to uncached speeds. If this one machine had to read large enough data (more than RAM) then it dropped back to normal expected uncached speeds. I worked out this machine I had experimented with installing multipath-tools, so I installed this on all other machines, and hopefully it will allow improved performance through caching of the iSCSI devices.

I haven't done anything with the partitions as yet, but are you basically suggesting the following:
1) Make sure the primary and secondary storage servers are in sync and running
2) Remove one SSD from the RAID5, delete the partition, clear the superblock/etc
3) Add the same SSD back as /dev/sdx instead of /dev/sdx1
4) Wait for sync
5) Go to 2 with the next SSD etc

This would move everything to the beginning of the disk by a small amount, but not change anything relatively regarding DRBD/LVM/etc .... 

Would I then need to do further tests to see if I need to do something more to move DRBD/LVM to the correct offset to ensure alignment? How would I test if that is needed?

>>> Keep us posted.
>> 
>> Will do, I'll have to price up the above options, and get approval
>for
>> purchase, and then will take a few days to get it all in place/etc...
>
>Given the temperature under the collar of the client, I'd simply spend
>on adding the 2 bonded ports to the DC box, make all of the LACP
>changes, and straighten out alignment/etc issues on the SSDs, md stripe
>cache, etc.  This will make substantial gains.  Once the client sees
>the
>positive results, then recommend the HBA for even better performance.
>Remember, Intel's 520 SSD data shows nearly double the performance
>using
>SATA3 vs SATA2.  Once you have alignment and md tuning squared away,
>moving to the LSI should nearly double your block throughput.

I;d prefer to do everything at once, then they will only pay once, and they should see a massive improvement in one jump. Smaller incremental improvement is harder for them to see..... Also, the HBA is not so expensive, I always assumed they were at least double or more in price....

Apologies if the above is 'confused', but I am :)

PS, was going to move one of the dual port cards from the secondary san to the DC machine, but haven't yet since I don't have enough switch ports, and now I'm really unsure whether what I have done will be an improvement anyway. Will find out tomorrow....

Summary of changes (more for my own reference in case I need to undo it tomorrow):
1) disable disk cache on all windows machines
2) san1/2 convert from balance-rr to 802.3ad and add xmit_hash_policy=1
3) change switch LAG from Static to LACP
4) install multipath-tools on all physical machines (no config, just a reboot)

Thanks,
Adam


Regards,
Adam

--
Adam Goryachev
Website Managers
www.websitemanagers.com.au

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance
  2013-02-10  4:40                 ` Adam Goryachev
@ 2013-02-10 13:22                   ` Stan Hoeppner
  2013-02-10 16:16                     ` Adam Goryachev
  0 siblings, 1 reply; 131+ messages in thread
From: Stan Hoeppner @ 2013-02-10 13:22 UTC (permalink / raw)
  To: Adam Goryachev; +Cc: Dave Cundiff, linux-raid

On 2/9/2013 10:40 PM, Adam Goryachev wrote:

> OK, so I changed the linux iSCSI server to 802.3ad mode, and that killed all networking, so I changed the switch config to use LACP, and then that was working again.

If not LACP, what mode were the switch ports in previously?

> I then tested single physical machine network performance (just a simple dd if=iscsi device of=/dev/null to read a few gig of data. I had some interesting results. Initially, each server individually could read around 120MB/s, so I tried 2 at the same time, and each got 120MB/s, so I tried three at a time, same result. Finally, testing 4 in parallel, two got 120MB/s and the other two got around 60MB/s. Eventually I worked out this:

When you say "machine" above, are you referring to physical machines, or
virtual machines?  Based on the 120/120/60/60 result with "4 machines",
I'm guessing you were only using 3 physical machines, testing from two
Windows guests on one of them.  If this is the case, the 60/60 is the
result of the two VMs sharing one physical GbE port.

> Server    Switch port
> 1               6
> 2               5
> 3               7
> 4               7
> 5               7
> 6               7
> 7               7
> 8               6

I don't follow this at all.

> So, for some reason, port 8 was never used, (unless I physically disconnected ports 5, 6 and 7). Also, a single port was shared for 5 machines, resulting in around 20MB/s for each (when testing all in parallel).

What exactly are you testing here?  To what end?

> I eventually changed the iSCSI server to use xmit_hash_policy to 1 (layer3+4) instead of layer2 hashing. This resulted in a minor improvement as follows:
> Server    Switch port
> 1               6
> 2               5
> 3               8
> 4               6
> 5               6
> 6               6
> 7               6
> 8               7
> 
> So now, I still have 5 machines sharing a single port, but the other three get a full port each. I'm not sure why the balancing is so poor... The port number should be the same for all machines (iscsi), but the IP's are consecutive (x.x.x.31 - x.x.x.38).

Ok, you've completely lost me.  5 hosts (machines) cannot share an
ethernet port.  So you must be referring to 5 VMs on a single host.  In
that case they share the ethernet bandwidth.  5 concurrent file
operations will result in ~20MB/s each.  The fact that you're getting
that from a Realtek 8111 is shocking.  Usually these chips suck with
this type of workload.

> Anyway, so I've configured the DC on machine 2, the three testing servers and two of the TS on the "shared port" machines, and the third TS and DB server onto the remaining machines.

> Any suggestions on how to better balance the traffic would be appreciated!!!

What type of traffic balancing are you asking for here?  Once you have
at least two bonded ports in the physical machine on which the DC VM
resides, and your 6 bonded links (IO server 4, DC 2) in LACP dynamic
mode, the switch will automatically balance session traffic on those
links.  I thought I explained this already.

>> That said, disabling the Windows write
>> caching on the local drives backed by the iSCSI LUNs might fix this as
>> well.  It should never be left enabled in a configuration such as
>> yours.
> 
> Have now done this across all the windows servers for all iSCSI drives, left it enabled for the RAM drive with the pagefile

That setting is supposed to enables/disable the cache *CHIP* on physical
drives.  A RAM drive doesn't have a cache chip.  Disable it just to keep
Windows from confusing itself.  Given that all of your Windows 'hosts'
are guest VMs, the command sent through the SCSI driver to disable the
drive cache is intercepted by Xen and discarded anyway.

I recommend disabling it so Windows doesn't confuse itself.  Windows is
infamous for doing all manner of undocumented things.  On the off chance
that having this setting enabled changes the behavior of something else
in Windows, which is expecting a drive cache to be present and enabled
when it in fact doesn't exist, you *need* to have it disabled for
safety.  Undocumented behavior is why I suspect having it enabled may
have contributed to those mysterious errors.  Give Windows enough rope
and it will hang itself.

Take away the rope.

> I'm assuming that is what I have now, but I didn't do write tests so I can't be sure the switch will properly balance the traffic back to the server

There is no "balancing" unless the load of two or more TCP sessions is
sufficiently high.  I tried to explain this previously.  When LACP
bonding is working properly, the only time you will see packet traffic
roughly evenly distributed across the DC host's bonded ports is when two
or more TS physical boxes have sustained file transfers going.  If that
switch can monitor port traffic in real time, you'll see the balancing
across the two ports.  You'll also see this on two ports in the IO
server's bond group.  If you simply look at the total metrics, those you
pasted here, 80-90% or more of the traffic to/from the DC box will be on
only one port.  Same with the IO server.  This is by design.  It is how
it is supposed to work.

>> Ah, here you go.  It does have port based ingress/egress rate limiting.
>> So you should be able to slow down the terminal server hosts so no
>> single one can flood the DC.  Very nice.  I wouldn't have expected this
>> in this class of switch.
> 
> I don't know if I want to do this, as it will also limit SMB, RDP. etc traffic just as much.... I'll leave it for now, and perhaps come back to it if it is still an issue.

Once you have at least two bonded ports in the DC box this shouldn't be
necessary.  If you put 4 bonded ports in, the issue is moot as then no
single box can flood any other single box, no matter which box we're
talking about --TS servers, DC, IO server-- no matter how many users are
doing what.  You could slap a DVD in every TS box on the network and
start a CIFS copy to any/all shares on the DC server.  Won't skip a
beat.  And if you configure a VLAN on that switch and enable QOS traffic
shaping, TS sessions wouldn't slow down, as you'd reserve priority for
RDP.  That's another thing that surprised me about this switch.  It's
got a ton of advanced features for its class.

>> So, you can fix the network performance problem without expending any
>> money.  You'll just have on TS host and its users bogged down when
>> someone does a big file copy.  And if you can find a Windows policy to
>> limit IO per user, you can solve it completely.
> 
> I'll look into this later, but this is pretty much acceptable, the main issue is where one machine can impact other machines.

Now that you know how to configure LACP properly on the bonded ports,
once you have a quad port NIC in the DC box this particular issue is
solved.  As I mentioned, with a dual port NIC this problem could still
occur if two users on two physical TS boxes both do a big file copy.  If
this was my project, I wouldn't do anything at this point but the quad
port card as it eliminates all doubt.  The extra $120 USD would
guarantee I didn't have this issue occur again.  But that's me.

>> That said, I'd still get two or 4 bonded ports into that DC share
>> server to speed things up for everyone.
> 
> OK, I'll need to think about this one carefully. I wanted all the 8 machines to be identical so that we can do live migration of the virtual machines, and also if physical hardware fails, then it is easy to reboot a VM on another physical host. If I add specialised hardware, then it requires the VM to run on that host, (well, would still work on another host with reduced performance, which is somewhat acceptable, but not preferable since might end up trying to fix a hardware failure and a performance issue at the same time, or other random issues related to the reduced performance.

I've been wondering since the beginning of this thread why you didn't
simply stick Samba on the IO server, format the LVM slice with XFS, and
serve CIFS shares directly.  You'd have had none of these problems, but
for the rr bonding mode.  File serving would simply scream.  The DC
could be a DC with a single NIC, same as the other boxen.  That's the
only way I'd have done this setup.  And the load of the DC VM is low
enough I'd have put it on one of the TS boxen and saved the cost of one box.

> OK, so apparently the motherboard on the physical machines will work fine with the dual or quad ethernet cards.

Great.  This keeps your options open.

> I'm not sure how this solves the problem though.
> 
> 1) TS user asks the DC to copy file1 from the shareA to shareA in a different folder
> 2) TS user asks the DC to copy file1 from the shareA to shareB
> 3) TS user asks the DC to copy file1 from the shareA to local drive C:
> 
> In cases 1 and 2, I assume the DC will not actually send the file content over SMB, it will just do the copy locally, but the DC will read from the SAN at single ethernet speed and write to the san  at single ethernet speed,  since even if the DC uses RR to send the data at 2x1Gbps, the switch is LACP so will forward to the iSCSI server at 1Gbps. Hence, iSCSI is maxed out at 1Gbps... The iSCSI potentially can satisfy other servers if LACP is not making them share the same ethernet. The DC can possibly, if LACP happens to choose the second port, be able to maintain SMB/RDP traffic. but if LACP shares the same port, then the second ethernet is wasted.

And now you you finally understand, I think, the limitations of bonding.
 To clearly spell them out, again:

1.  Ethernet bonding increases throughput for multi stream workloads
2.  Ethernet bonding does not increase the throughput of single stream
    workloads
3.  To increase throughput of single stream workloads a single faster
    link is required, in this case 10GbE.

Thankfully you have a multi-user workload, the perfect fit for bonding.
 You don't need 10Gb/s for a single user.  You need multiple 1Gb/s links
for the occasion that multiple users each need one GbE link worth of
throughput without starving others.

Have you ever owned or driven a turbocharged vehicle?  Cruising down the
highway the turbo is spinning at a low idle RPM.  When you need to pass
someone, you drop a gear and hammer the throttle.   The turbo spins up
from 20K RPM to 160K RPM in about 1/5th of a second, adding 50-100HP to
the engine's output.

This is in essence what bonding does for you.  It kicks in the turbo
when you need it, but leaves it at idle when you don't.  In this case
the turbo being extra physical links in the bond.

> Regardless of what number of network ports are on the physical machines, the SAN will only send/receive at a max of 1G per machine 

The IO server has 4 ports, so if you get the SSD array working as it
should, the IO server could move up to 8Gb/s, 1Gb/s each way.

> so the DC is still limited to 1G total iSCSI bandwidth. 

No.  With a bonded dual port NIC, it's 2Gb/s aggregate each way.  To
reach that requires at least two TCP session streams (or UDP).  This
could be two users on two TS servers each doing one file copy.  Or it
could be a combination of 100 streams from 100 users all doing large or
small CIFS transfers concurrently.  The more streams the better, if you
want to get both links into play.

You can test this easily yourself once you get a multiport NIC in the DC
box.  SSH into a Xen console on the DC box and launch iftop.  Then log
into two TS servers and start two large file copies from one DC share to
another.  This will saturate both Tx/Rx on both NIC ports.  Watch iftop.
 You should see pretty close to 4Gb/s throughput, 2Gb/s out and 2Gb/s in.

> If I use RR on the DC, then it has 2G write and only 1G read performance, which seems strange.

Don't use RR.  Recall the problem RR on the IO server's 4 ports caused?
 Those 1.2 million pause frames being kicked back by the switch?  This
was due to the 4:1 b/w gap between the IO server NICs and the DC server
NIC.  If you configure balance-rr on the DC Xen host you'll get the same
problem talking to the TS boxen with single NICs.

> The more I think about this, the worse it seems to get... It almost seems I should do this:

Once you understand ethernet bonding a little better, how the different
modes work, the capabilities and limitations of each, you'll realize
things are getting better, not worse.

> 1) iSCSI uses RR and switch uses LAG (LACP)
> 2) All physical machines have a dual ethernet and use RR, and the switch uses LAG (LACP)
> 3) On the iSCSI server, I configure some sort of bandwidth shaping, so that the DC gets 2Gbps, and all other machines get 1Gbps
> 4) On the physical machines, I configure some sort of bandwidth shaping so that all VM's other than the DC get limited to 1Gbps
> 
> This seems like a horrible, disgusting hack, and I would really hate myself for trying to implement it, and I don't know that Linux will be good at limiting speeds this fast including CPU overhead concerns, etc
> 
> I'm in a mess here, and not sure any of this makes sense...

You're moving in the wrong direction, fast.  Must be lack of sleep or
something. ;)

> How about:
> 1) Add dual port ethernet to each physical box
> 2) Use the dual port ethernet in RR to connect to the iSCSI
> 3) Use the onboard ethernet for the user network
> 4) Configure the iSCSI server in RR again

/rolls eyes

You don't seem to be getting this...

> This means the TS and random desktop's get a full 1Gbps for SMB access, the same as they had when it was a physical machine
> The DC gets a full 2Gbps access to the iSCSI server, the iSCSI server might send/flood the link, but I assume since there is only iSCSI traffic, we don't care.
> The TS can also do 2Gbps to the iSCSI server, but again this is OK because the iSCSI has 4Gbps available
> If a user copies a large file from the DC to local drive, it floods the 1G user LAN with SMB, which uses only 1Gbps on the iSCSI LAN for the DC, and 1Gbps for the TS on the iSCSI LAN (total 2Gbps on the iSCSI SAN).
> 
> To make this work, I need 8 x dual port cards, or in reality, 2 x 4port cards plus 4 x 2port cards (putting 4port cards into the san, and moving existing 2port cards), then I need a 48 port switch to connect everything up, and then I'm finished.
> 
> Add SATA card to the SAN, and I'm laughing.... sure, it's a chunk of new hardware, but it just doesn't seem to work right any other way I think about it.

No, no, no, no, no.  No....

> So, purchase list becomes:
> 2 x 4port ethernet card $450 each
> 4 x 2port ethernet card $161 each
> 1 x 48 port switch (any suggestions?) $600
> 2 x LSI HBA  $780
> Total Cost: $2924
> 
>> Again, you have all the network hardware you need, so this is
>> completely unnecessary.  You just need to get what you have
>> configured correctly.
> 
>> Everything above should be even more helpful.  My apologies for not
>> having precise LACP insight in my previous post.  It's been quite a
>> while and I was rusty, and didn't have time to refresh my knowledge
>> base before the previous post.
> 
> I don't see how LACP will make it better, well, it will stop sending pause commands, but other than that, it seems to limit the bandwidth to even less than 1Gbps. The question was asked if it would be worthwhile to just upgrade to 10Gbps network for all machines.... I haven't looked at costing on that option, but I assume it is really just the same problem anyway, either speeds are unbalanced if server has more bandwidth, or speeds are balanced if server has equal bandwidth/limited balancing with LACP asiide)

Please re-read my previous long explanation email, and what I wrote
above.  This is so so simple...

Assuming you don't put Samba on the IO server which will fix all of this
with one silver bullet, the other silver bullet is to stick a quad port
NIC in the DC server, then configure it, the IO server, and the bonded
switch ports for LACP Dynamic mode, AND YOU'RE DONE with the networking
issues.

Then all you have left straightening out the disk IO performance on the
IO server.

> BTW, reading at www.kernel.org/doc/Documentation/networking/bonding.txt in chapter 12.1.1 I think maybe balance-alb might be a better solution? It sounds like it would at least do a better job at avoiding 5 machines being on the same link .... 

"It sounds like it would at least do a better job at avoiding 5 machines
being on the same link .... "

The "5 machines" on one link are 5 VMs on a host with one NIC.  Bonding
doesn't exist on single NIC ports.  You've totally lost me here...

> I will suggest the HBA anyway, might as well improve that now anyway, and it also adds options for future expansion (up to 8 x SSD's). 

I usually suggest a real SAS/SATA HBA right away, but given what you
said about the client's state of mind, troubleshooting the current stuff
made more sense.

> I can't find that exact one, my supplier has suggested the LSI SAS 9211-8i pack for $390 or the LSI MegaRAID SAS 9240-8i pack for $429, is one of these equivalent/comparable?

9211-8i pack for $390  -- this should be the one with cables.  Confirm
first as you'll need to order 2 breakout cables if it doesn't come with
them.  LSI calls it "kit" instead of "pack".  This is one of the two
models I mentioned, good HBA.  The other was the 9207-8i which has
double the IOPS.  Your vendor doesn't offer it?  Wow...

9240-8i --  NO.  You don't want this.  Same chip as the 9211-8i but the
ports aim up, not forward, which always sucks.  The main difference is
the 9211-8i does hardware 0,1,1E,10, whereas the 9240 add hardware
RAID5/50.  As hardware RAID cards the performance of both sucks, only
suitable for a few spinning drives in a SOHO server.  In HBA mode
they're great for md/RAID and have good performance.  So why pay $40
more for shitty hardware RAID5/50 you won't use?

Neither is a great candidate for SSDs, but better than all competing
brands in this class.  The 9207-8i is the HBA you really want for SSDs.
 The chip on it is 3 generations newer than these two, and it has double
the IOPS.  It's a PCIe 3.0 card, LSI's newest HBA.  As per PCIe spec it
works in 2.0 and 1.0 slots as well.  I think your Intel server board is
2.0.  It's only $40 USD more over here.  If you get up to 8 of those
SSDs you'll really want to have this in the box instead of the 9211-8i
which won't be able to keep up.

> When doing the above dd tests, I noticed one machine would show 2.6GB/s for the second or subsequent reads (ie, cached) while all the other machines would show consistent read speeds equivalent to uncached speeds. If this one machine had to read large enough data (more than RAM) then it dropped back to normal expected uncached speeds. I worked out this machine I had experimented with installing multipath-tools, so I installed this on all other machines, and hopefully it will allow improved performance through caching of the iSCSI devices.

The boxes have a single NIC.  If MS multipath increases performance it's
because of another undocumented feature(bug).  You can't multipath down
a single ethernet link.

> I haven't done anything with the partitions as yet, but are you basically suggesting the following:
> 1) Make sure the primary and secondary storage servers are in sync and running
> 2) Remove one SSD from the RAID5, delete the partition, clear the superblock/etc
> 3) Add the same SSD back as /dev/sdx instead of /dev/sdx1
> 4) Wait for sync
> 5) Go to 2 with the next SSD etc

No.  Simply execute 'fdisk -lu /dev/sdX' for each SSD and post the
output.  The critical part is to make sure the partitions start at the
first sector, and if they don't they should start at a sector number
divisible by either the physical sector size or the erase block size.
I'm not sure what the erase block size is for these Intel SSDs.

> This would move everything to the beginning of the disk by a small amount, but not change anything relatively regarding DRBD/LVM/etc .... 

Oh, ok.  So you already know you created the partitions starting some
number of sectors after the start of the driver.  If they don't start at
a sector number described above, that would explain at least some of the
apparently low block IO performance.

> Would I then need to do further tests to see if I need to do something more to move DRBD/LVM to the correct offset to ensure alignment? How would I test if that is needed?

Might need to get Neil or Phil, somebody else, involved here.  I'm not
sure if you'd want to do this on the fly with multiple md rebuilds, or
if you'd need to blow away the array and start over.  They sit atop md
and its stripe parameters won't change, so there's probably nothing
needed to be done with them.

>>>> Keep us posted.
>>>
>>> Will do, I'll have to price up the above options, and get approval
>> for
>>> purchase, and then will take a few days to get it all in place/etc...
>>
>> Given the temperature under the collar of the client, I'd simply spend
>> on adding the 2 bonded ports to the DC box, make all of the LACP
>> changes, and straighten out alignment/etc issues on the SSDs, md stripe
>> cache, etc.  This will make substantial gains.  Once the client sees
>> the
>> positive results, then recommend the HBA for even better performance.
>> Remember, Intel's 520 SSD data shows nearly double the performance
>> using
>> SATA3 vs SATA2.  Once you have alignment and md tuning squared away,
>> moving to the LSI should nearly double your block throughput.
> 
> I;d prefer to do everything at once, then they will only pay once, and they should see a massive improvement in one jump. Smaller incremental improvement is harder for them to see..... Also, the HBA is not so expensive, I always assumed they were at least double or more in price....

Agreed.  I must have misunderstood the level of, ahem, discontent of the
client.  WRT to the HBAs, you were probably thinking of the full up LSI
RAID cards, which run ~$350-1400 USD.

> Apologies if the above is 'confused', but I am :)

Hopefully I helped clear things up a bit here.

> PS, was going to move one of the dual port cards from the secondary san to the DC machine, but haven't yet since I don't have enough switch ports, and now I'm really unsure whether what I have done will be an improvement anyway. Will find out tomorrow....

I wasn't aware you were low on Cu ports.

> Summary of changes (more for my own reference in case I need to undo it tomorrow):
> 1) disable disk cache on all windows machines
> 2) san1/2 convert from balance-rr to 802.3ad and add xmit_hash_policy=1
> 3) change switch LAG from Static to LACP
> 4) install multipath-tools on all physical machines (no config, just a reboot)

Hmm... #4  On machines with single NIC ports multipath will do nothing
good.  On machines with multiple physical interfaces that have been
bonded, you only have one path, so again, nothing good will arise.
Maybe you know something here I don't.

Hope things start falling into place for ya.

-- 
Stan


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance
  2013-02-10 13:22                   ` Stan Hoeppner
@ 2013-02-10 16:16                     ` Adam Goryachev
  2013-02-10 17:19                       ` Mikael Abrahamsson
  2013-02-12  2:46                       ` Stan Hoeppner
  0 siblings, 2 replies; 131+ messages in thread
From: Adam Goryachev @ 2013-02-10 16:16 UTC (permalink / raw)
  To: stan; +Cc: Dave Cundiff, linux-raid

Stan Hoeppner <stan@hardwarefreak.com> wrote:

>On 2/9/2013 10:40 PM, Adam Goryachev wrote:
>
>> OK, so I changed the linux iSCSI server to 802.3ad mode, and that
>killed all networking, so I changed the switch config to use LACP, and
>then that was working again.
>
>If not LACP, what mode were the switch ports in previously?

I had them configured as a LAG, and that was in static mode. I just changed the static to LACP.
So I now have:
LAG1 ports 1,2,3,4 in LACP mode
LAG2 ports 5,6,7,8 in LACP mode

>> I then tested single physical machine network performance (just a
>simple dd if=iscsi device of=/dev/null to read a few gig of data. I had
>some interesting results. Initially, each server individually could
>read around 120MB/s, so I tried 2 at the same time, and each got
>120MB/s, so I tried three at a time, same result. Finally, testing 4 in
>parallel, two got 120MB/s and the other two got around 60MB/s.
>Eventually I worked out this:
>
>When you say "machine" above, are you referring to physical machines,
>or
>virtual machines?  Based on the 120/120/60/60 result with "4 machines",
>I'm guessing you were only using 3 physical machines, testing from two
>Windows guests on one of them.  If this is the case, the 60/60 is the
>result of the two VMs sharing one physical GbE port.

I'm referring to physical machines... This entire email is based on all VM's being shutdown during testing. I have 8 physical boxes to run the VM's on, and 2 physical boxes for the storage servers. Only one storage server is operating at one time.

>> Server    Switch port
>> 1               6
>> 2               5
>> 3               7
>> 4               7
>> 5               7
>> 6               7
>> 7               7
>> 8               6
>
>I don't follow this at all.

Server number is the 8 physical machines, switch port is the physical switch port number that the SAN server used to send the data to the physical machine.

>> So, for some reason, port 8 was never used, (unless I physically
>disconnected ports 5, 6 and 7). Also, a single port was shared for 5
>machines, resulting in around 20MB/s for each (when testing all in
>parallel).
>
>What exactly are you testing here?  To what end?

Trying to ensure:
a) all physical boxes are working at full 1Gbps speeds
b) that the LACP is working to balance the traffic across the 4 links

>> I eventually changed the iSCSI server to use xmit_hash_policy to 1
>(layer3+4) instead of layer2 hashing. This resulted in a minor
>improvement as follows:
>> Server    Switch port
>> 1               6
>> 2               5
>> 3               8
>> 4               6
>> 5               6
>> 6               6
>> 7               6
>> 8               7
>> 
>> So now, I still have 5 machines sharing a single port, but the other
>three get a full port each. I'm not sure why the balancing is so
>poor... The port number should be the same for all machines (iscsi),
>but the IP's are consecutive (x.x.x.31 - x.x.x.38).
>
>Ok, you've completely lost me.  5 hosts (machines) cannot share an
>ethernet port.  So you must be referring to 5 VMs on a single host.  In
>that case they share the ethernet bandwidth.  5 concurrent file
>operations will result in ~20MB/s each.  The fact that you're getting
>that from a Realtek 8111 is shocking.  Usually these chips suck with
>this type of workload.

Nope, I'm saying that on 5 different (specifically machines 1, 4, 5, 6, 7) physical boxes, (the xen host) if I do a dd if=/dev/disk/by-path/iscsivm1 of=/dev/null on 5 machines concurrently, then they only get 20Mbps each. If I do one at a time, I get 130Mbps, if I do two at a time, I get 60Mbps, etc... If I do the same test on machines 1, 2, 3, 8 at the same time, each gets 130Mbps 

(Note, this doesn't test the SSD speed etc since all the machines are reading the same data at the same time, so it should be all cached at the iSCSI server side)

>> Anyway, so I've configured the DC on machine 2, the three testing
>servers and two of the TS on the "shared port" machines, and the third
>TS and DB server onto the remaining machines.
>
>> Any suggestions on how to better balance the traffic would be
>appreciated!!!
>
>What type of traffic balancing are you asking for here?  Once you have
>at least two bonded ports in the physical machine on which the DC VM
>resides, and your 6 bonded links (IO server 4, DC 2) in LACP dynamic
>mode, the switch will automatically balance session traffic on those
>links.  I thought I explained this already.

The problem is that (from my understanding) LACP will balance the traffic based on the destination MAC address, by default. So the bandwidth between any two machines is limited to a single 1Gbps link. So regardless of the number of ethernet ports on the DC box, it will only ever use a max of 1Gb[s to talk to the iSCSI server.

However, if I configure Linux to use xmit_hash_policy=1 it will use the IP address and port (layer 3+4) to decide which trunk to use. It will still only use 1Gbps to talk to that IP:port combination.

>>> That said, disabling the Windows write
>>> caching on the local drives backed by the iSCSI LUNs might fix this
>as
>>> well.  It should never be left enabled in a configuration such as
>>> yours.
>> 
>> Have now done this across all the windows servers for all iSCSI
>drives, left it enabled for the RAM drive with the pagefile
>
>That setting is supposed to enables/disable the cache *CHIP* on
>physical drives.  A RAM drive doesn't have a cache chip.  Disable it just to
>keep Windows from confusing itself.  Given that all of your Windows 'hosts'
>are guest VMs, the command sent through the SCSI driver to disable the
>drive cache is intercepted by Xen and discarded anyway.
>
>I recommend disabling it so Windows doesn't confuse itself.  Windows is
>infamous for doing all manner of undocumented things.  On the off
>chance
>that having this setting enabled changes the behavior of something else
>in Windows, which is expecting a drive cache to be present and enabled
>when it in fact doesn't exist, you *need* to have it disabled for
>safety.  Undocumented behavior is why I suspect having it enabled may
>have contributed to those mysterious errors.  Give Windows enough rope
>and it will hang itself.
>
>Take away the rope.

OK, will do. Just to recap, windows is limited to 4G RAM, so the xen host is allocating 4G RAM to the windows VM, it is also passing in a virtual SCSI drive 4G in size. This virtual SCSI drive is a Linux 4G RAM drive. Windows has formatted it as a 4G drive and using it for a 4G pagefile.

Anyway, I will disable it, I doubt it will make any difference, but as you said, best to remove the rope.

>> I'm assuming that is what I have now, but I didn't do write tests so
>I can't be sure the switch will properly balance the traffic back to
>the server
>
>There is no "balancing" unless the load of two or more TCP sessions is
>sufficiently high.  I tried to explain this previously.  When LACP
>bonding is working properly, the only time you will see packet traffic
>roughly evenly distributed across the DC host's bonded ports is when
>two
>or more TS physical boxes have sustained file transfers going.  If that
>switch can monitor port traffic in real time, you'll see the balancing
>across the two ports.  You'll also see this on two ports in the IO
>server's bond group.  If you simply look at the total metrics, those
>you
>pasted here, 80-90% or more of the traffic to/from the DC box will be
>on
>only one port.  Same with the IO server.  This is by design.  It is how
>it is supposed to work.

OK, so I think this is the problem, I haven't properly explained my environment...

There are 8 physical boxes used to run Xen
Each box has iscsi configured to connect to the iSCSI server
This produces one device on the xen host /dev/sdX for each of the LV's on the SAN
However, the SAN will see all iSCSI traffic as being from a single IP:port for each xen server (a total of 8 sessions)
Regardless of the number of TS users, or simultaneous copies to/from the DC, if the DC needs to read 5 different files, it will do so from it's virtual SCSI drive that xen has provided, xen will pass those to Linux, which will pass the requests to the iSCSI software, which will send them to the SAN (from the same IP:Port), and the san will reply over the same 1Gbps link for all 5 requests. Thus, the DC has a max of 1Gbps bandwidth to talk to the DC. 

>>> Ah, here you go.  It does have port based ingress/egress rate
>limiting.
>>> So you should be able to slow down the terminal server hosts so no
>>> single one can flood the DC.  Very nice.  I wouldn't have expected
>this
>>> in this class of switch.
>> 
>> I don't know if I want to do this, as it will also limit SMB, RDP.
>etc traffic just as much.... I'll leave it for now, and perhaps come
>back to it if it is still an issue.
>
>Once you have at least two bonded ports in the DC box this shouldn't be
>necessary.  If you put 4 bonded ports in, the issue is moot as then no
>single box can flood any other single box, no matter which box we're
>talking about --TS servers, DC, IO server-- no matter how many users
>are
>doing what.  You could slap a DVD in every TS box on the network and
>start a CIFS copy to any/all shares on the DC server.  Won't skip a
>beat.

OK, thinking about this extreme example....

Assume the DC box has a 4port ethernet, and the TS boxes are limited to the existing single port ethernet.

4 TS machines are each copying 8G of data to a share on the DC
Each TS tries to send data at 1Gbps to the DC using SMB
The switch will load balance based on MAC addresses, which may potentially place more than one stream on the same physical port, but assuming best case, the DC will receive the SMB data at 4Gbps, worst case is all traffic is on a single port and receives at 1Gbps.
The DC will then write 4 streams of data to its SCSI disk (it doesn't know this is iSCSI)
The xen host of the DC VM will then write the 4 streams to the iSCSI disk (to the same destination IP:port and same MAC)
The switch will send all data to a single ethernet port of the SAN, maximum of 1Gbps

Thus, all TS boxes combined have a max write performance of 1Gbps to the SAN

> And if you configure a VLAN on that switch and enable QOS
>traffic shaping, TS sessions wouldn't slow down, as you'd reserve
> priority for RDP.  That's another thing that surprised me about this
> switch.  It's got a ton of advanced features for its class.

Sure, I could setup the QoS to prioritize the RDP traffic, then SMB traffic and then iSCSI traffic..... or are you suggesting bandwidth reservation per protocol? This will just carve the 1Gbps link speed into smaller pieces, does ensure that each protocol gets it's own bit, and none is starved, although using different networks (physical network cards/ports) would do this as well, and without reducing anything to less than 1Gbps...

>>> So, you can fix the network performance problem without expending
>any
>>> money.  You'll just have on TS host and its users bogged down when
>>> someone does a big file copy.  And if you can find a Windows policy
>to
>>> limit IO per user, you can solve it completely.
>> 
>> I'll look into this later, but this is pretty much acceptable, the
>main issue is where one machine can impact other machines.
>
>Now that you know how to configure LACP properly on the bonded ports,
>once you have a quad port NIC in the DC box this particular issue is
>solved.  As I mentioned, with a dual port NIC this problem could still
>occur if two users on two physical TS boxes both do a big file copy. 
>If
>this was my project, I wouldn't do anything at this point but the quad
>port card as it eliminates all doubt.  The extra $120 USD would
>guarantee I didn't have this issue occur again.  But that's me.

As mentioned, I don;t see how this will be enough... The DC box with quad port will only use one of the four ports to talk to the SAN, since the switch will only send data form that MAC address to that MAC address down a single port.

Unless I use RR on both the SAN and the DC, and both have 4 port cards. Then the SAN server will still flood the TS boxes, but that shouldn't matter, and the DC box can consume 100% of the bandwidth to the SAN, which will limit performance of the rest of the TS/DB servers.....

>>> That said, I'd still get two or 4 bonded ports into that DC share
>>> server to speed things up for everyone.
>> 
>> OK, I'll need to think about this one carefully. I wanted all the 8
>machines to be identical so that we can do live migration of the
>virtual machines, and also if physical hardware fails, then it is easy
>to reboot a VM on another physical host. If I add specialised hardware,
>then it requires the VM to run on that host, (well, would still work on
>another host with reduced performance, which is somewhat acceptable,
>but not preferable since might end up trying to fix a hardware failure
>and a performance issue at the same time, or other random issues
>related to the reduced performance.
>
>I've been wondering since the beginning of this thread why you didn't
>simply stick Samba on the IO server, format the LVM slice with XFS, and
>serve CIFS shares directly.  You'd have had none of these problems, but
>for the rr bonding mode.  File serving would simply scream.  The DC
>could be a DC with a single NIC, same as the other boxen.  That's the
>only way I'd have done this setup.  And the load of the DC VM is low
>enough I'd have put it on one of the TS boxen and saved the cost of one
>box.

To be honest, I wanted to move the DC and file server to a Linux VM, since at the time it was only an NT box, but I did need to upgrade to provide proper AD for one new machine, and I didn't want to upgrade to the new samba just released last year... Also, I couldn't split the data shares from the DC since that would change the UNC path for the shares, and that would be a complicated job to fix everything that breaks.... This is an old environment with plenty of legacy apps. The reason it still ran NT was partly because nobody wanted to be responsible for breaking things. Anyway, it's upgraded to win2k now, and running in a VM, it will get upgraded to win2k3 soon, but I'm stuck working on this performance issue first...

>> OK, so apparently the motherboard on the physical machines will work
>fine with the dual or quad ethernet cards.
>Great.  This keeps your options open.
>
>> I'm not sure how this solves the problem though.
>> 
>> 1) TS user asks the DC to copy file1 from the shareA to shareA in a
>different folder
>> 2) TS user asks the DC to copy file1 from the shareA to shareB
>> 3) TS user asks the DC to copy file1 from the shareA to local drive
>C:
>> 
>> In cases 1 and 2, I assume the DC will not actually send the file
>content over SMB, it will just do the copy locally, but the DC will
>read from the SAN at single ethernet speed and write to the san  at
>single ethernet speed,  since even if the DC uses RR to send the data
>at 2x1Gbps, the switch is LACP so will forward to the iSCSI server at
>1Gbps. Hence, iSCSI is maxed out at 1Gbps... The iSCSI potentially can
>satisfy other servers if LACP is not making them share the same
>ethernet. The DC can possibly, if LACP happens to choose the second
>port, be able to maintain SMB/RDP traffic. but if LACP shares the same
>port, then the second ethernet is wasted.
>
>And now you you finally understand, I think, the limitations of
>bonding.
> To clearly spell them out, again:
>
>1.  Ethernet bonding increases throughput for multi stream workloads
>2.  Ethernet bonding does not increase the throughput of single stream
>    workloads
>3.  To increase throughput of single stream workloads a single faster
>    link is required, in this case 10GbE.

So, ignoring the SMB traffic, we are saying that iSCSI performance is workload number 2, and will not benefit from multi NIC's in each box...

>Thankfully you have a multi-user workload, the perfect fit for bonding.
>You don't need 10Gb/s for a single user.  You need multiple 1Gb/s links
>for the occasion that multiple users each need one GbE link worth of
>throughput without starving others.

I don't think so, the SMB traffic can be balanced, but the DC can still only read/write at a max of 1Gbps from the SAN....

>Have you ever owned or driven a turbocharged vehicle?  Cruising down
>the
>highway the turbo is spinning at a low idle RPM.  When you need to pass
>someone, you drop a gear and hammer the throttle.   The turbo spins up
>from 20K RPM to 160K RPM in about 1/5th of a second, adding 50-100HP to
>the engine's output.
>
>This is in essence what bonding does for you.  It kicks in the turbo
>when you need it, but leaves it at idle when you don't.  In this case
>the turbo being extra physical links in the bond.

No, but I would like to think I understand how it should work... in an ideal environment....

>> Regardless of what number of network ports are on the physical
>machines, the SAN will only send/receive at a max of 1G per machine 
>
>The IO server has 4 ports, so if you get the SSD array working as it
>should, the IO server could move up to 8Gb/s, 1Gb/s each way.
>
>> so the DC is still limited to 1G total iSCSI bandwidth. 
>
>No.  With a bonded dual port NIC, it's 2Gb/s aggregate each way.  To
>reach that requires at least two TCP session streams (or UDP).  This
>could be two users on two TS servers each doing one file copy.  Or it
>could be a combination of 100 streams from 100 users all doing large or
>small CIFS transfers concurrently.  The more streams the better, if you
>want to get both links into play.

Nope, since there is a max of 8 streams to the iSCSI server, and they are being balanced really badly with 5 out of 8 on the same physical port...

>You can test this easily yourself once you get a multiport NIC in the
>DC box.  SSH into a Xen console on the DC box and launch iftop.
>Then log into two TS servers and start two large file copies from one
> DC share to another.  This will saturate both Tx/Rx on both NIC ports.
>Watch iftop.
>You should see pretty close to 4Gb/s throughput, 2Gb/s out and 2Gb/s
>in.

Again, assuming a quad port NIC in the DC
The two TS boxes ask to read a file from SMB
The DC box asks to read two files from disk
The Xen box asks to read two files (just random block) from the iSCSI
The iSCSI replies with the data
The xen box passes up the layer
The DC box asks to write the data back to disk
The xen box passes the data to iSCSI
The iSCSI receives the data and writes to disk

The problem is there is only a single stream for the iscsi replies with the data and the iscsi receives the data, so both are limited to 1Gbps (a total of 2Gbps full duplex) bandwidth on both the DC and the iSCSI regardless of the number of ports each has.

>> If I use RR on the DC, then it has 2G write and only 1G read
>performance, which seems strange.
>
>Don't use RR.  Recall the problem RR on the IO server's 4 ports caused?
> Those 1.2 million pause frames being kicked back by the switch?  This
>was due to the 4:1 b/w gap between the IO server NICs and the DC server
>NIC.  If you configure balance-rr on the DC Xen host you'll get the
>same
>problem talking to the TS boxen with single NICs.

Only when the TS is reading from SMB will the DC flood it... or when the TS is reading from iSCSI will it get flooded also.

However, using different networks, where the DC has only 1Gbps for the SMB network and 4Gbps for iSCSI will solve the first half of that problem, and prevent the second half. In fact, if all xen hosts had 4 port ethernet, then there is no flooding anywhere, except that each box could consume 100% of the SAN bandwidth, though I think TCP is pretty good at reducing the speed of the first connection until they are about equal...

>> The more I think about this, the worse it seems to get... It almost
>seems I should do this:
>
>Once you understand ethernet bonding a little better, how the different
>modes work, the capabilities and limitations of each, you'll realize
>things are getting better, not worse.
>
>> 1) iSCSI uses RR and switch uses LAG (LACP)
>> 2) All physical machines have a dual ethernet and use RR, and the
>switch uses LAG (LACP)
>> 3) On the iSCSI server, I configure some sort of bandwidth shaping,
>so that the DC gets 2Gbps, and all other machines get 1Gbps
>> 4) On the physical machines, I configure some sort of bandwidth
>shaping so that all VM's other than the DC get limited to 1Gbps
>> 
>> This seems like a horrible, disgusting hack, and I would really hate
>myself for trying to implement it, and I don't know that Linux will be
>good at limiting speeds this fast including CPU overhead concerns, etc
>> 
>> I'm in a mess here, and not sure any of this makes sense...
>
>You're moving in the wrong direction, fast.  Must be lack of sleep or
>something. ;)

I won't deny that... though I've just had about 6 hours sleep, and it's only 3am.. will go back to sleep after this email to ensure I'm ready for a busy day tomorrow.

>> How about:
>> 1) Add dual port ethernet to each physical box
>> 2) Use the dual port ethernet in RR to connect to the iSCSI
>> 3) Use the onboard ethernet for the user network
>> 4) Configure the iSCSI server in RR again
>
>/rolls eyes
>
>You don't seem to be getting this...
>
>> This means the TS and random desktop's get a full 1Gbps for SMB
>access, the same as they had when it was a physical machine
>> The DC gets a full 2Gbps access to the iSCSI server, the iSCSI server
>might send/flood the link, but I assume since there is only iSCSI
>traffic, we don't care.
>> The TS can also do 2Gbps to the iSCSI server, but again this is OK
>because the iSCSI has 4Gbps available
>> If a user copies a large file from the DC to local drive, it floods
>the 1G user LAN with SMB, which uses only 1Gbps on the iSCSI LAN for
>the DC, and 1Gbps for the TS on the iSCSI LAN (total 2Gbps on the iSCSI
>SAN).
>> 
>> To make this work, I need 8 x dual port cards, or in reality, 2 x
>4port cards plus 4 x 2port cards (putting 4port cards into the san, and
>moving existing 2port cards), then I need a 48 port switch to connect
>everything up, and then I'm finished.
>> 
>> Add SATA card to the SAN, and I'm laughing.... sure, it's a chunk of
>new hardware, but it just doesn't seem to work right any other way I
>think about it.
>
>No, no, no, no, no.  No....
>
>> So, purchase list becomes:
>> 2 x 4port ethernet card $450 each
>> 4 x 2port ethernet card $161 each
>> 1 x 48 port switch (any suggestions?) $600
>> 2 x LSI HBA  $780
>> Total Cost: $2924
>> 
>>> Again, you have all the network hardware you need, so this is
>>> completely unnecessary.  You just need to get what you have
>>> configured correctly.
>> 
>>> Everything above should be even more helpful.  My apologies for not
>>> having precise LACP insight in my previous post.  It's been quite a
>>> while and I was rusty, and didn't have time to refresh my knowledge
>>> base before the previous post.
>> 
>> I don't see how LACP will make it better, well, it will stop sending
>pause commands, but other than that, it seems to limit the bandwidth to
>even less than 1Gbps. The question was asked if it would be worthwhile
>to just upgrade to 10Gbps network for all machines.... I haven't looked
>at costing on that option, but I assume it is really just the same
>problem anyway, either speeds are unbalanced if server has more
>bandwidth, or speeds are balanced if server has equal bandwidth/limited
>balancing with LACP asiide)
>
>Please re-read my previous long explanation email, and what I wrote
>above.  This is so so simple...
>
>Assuming you don't put Samba on the IO server which will fix all of
>this
>with one silver bullet, the other silver bullet is to stick a quad port
>NIC in the DC server, then configure it, the IO server, and the bonded
>switch ports for LACP Dynamic mode, AND YOU'RE DONE with the networking
>issues.

Potentially I could run xen on the storage server, but I really wanted to have clearly defined storage servers and VM servers... They run different Linux kernels/etc, storage server has less RAM, etc.... Though yes, I suppose that could work. Equally, I don't/can't use samba on the storage server due to the change in path for the data storage... This just seems like replacing the current challenging task with another...

>Then all you have left straightening out the disk IO performance on the
>IO server.
>
>> BTW, reading at
>www.kernel.org/doc/Documentation/networking/bonding.txt in chapter
>12.1.1 I think maybe balance-alb might be a better solution? It sounds
>like it would at least do a better job at avoiding 5 machines being on
>the same link .... 
>
>"It sounds like it would at least do a better job at avoiding 5
>machines
>being on the same link .... "
>
>The "5 machines" on one link are 5 VMs on a host with one NIC.  Bonding
>doesn't exist on single NIC ports.  You've totally lost me here...

Nope, they are 5 ethernet links, 5 physical boxes.... sharing 4 ethernet link at the storage server side. Except all the data only uses one out of the 4 ports.

>> I will suggest the HBA anyway, might as well improve that now anyway,
>and it also adds options for future expansion (up to 8 x SSD's). 
>
>I usually suggest a real SAS/SATA HBA right away, but given what you
>said about the client's state of mind, troubleshooting the current
>stuff
>made more sense.
>
>> I can't find that exact one, my supplier has suggested the LSI SAS
>9211-8i pack for $390 or the LSI MegaRAID SAS 9240-8i pack for $429, is
>one of these equivalent/comparable?
>
>9211-8i pack for $390  -- this should be the one with cables.  Confirm
>first as you'll need to order 2 breakout cables if it doesn't come with
>them.  LSI calls it "kit" instead of "pack".  This is one of the two
>models I mentioned, good HBA.  The other was the 9207-8i which has
>double the IOPS.  Your vendor doesn't offer it?  Wow...

>Neither is a great candidate for SSDs, but better than all competing
>brands in this class.  The 9207-8i is the HBA you really want for SSDs.
>The chip on it is 3 generations newer than these two, and it has double
>the IOPS.  It's a PCIe 3.0 card, LSI's newest HBA.  As per PCIe spec it
>works in 2.0 and 1.0 slots as well.  I think your Intel server board is
>2.0.  It's only $40 USD more over here.  If you get up to 8 of those
>SSDs you'll really want to have this in the box instead of the 9211-8i
>which won't be able to keep up.

I'll push again for the 9207-8i, I was asking the question of my supplier on a Saturday, which happened to also be Chinese New Year Eve.... so hopefully Monday will allow them to search the chain more easily.... 

>> When doing the above dd tests, I noticed one machine would show
>2.6GB/s for the second or subsequent reads (ie, cached) while all the
>other machines would show consistent read speeds equivalent to uncached
>speeds. If this one machine had to read large enough data (more than
>RAM) then it dropped back to normal expected uncached speeds. I worked
>out this machine I had experimented with installing multipath-tools, so
>I installed this on all other machines, and hopefully it will allow
>improved performance through caching of the iSCSI devices.
>
>The boxes have a single NIC.  If MS multipath increases performance
>it's because of another undocumented feature(bug).  You can't multipath
>down a single ethernet link.

No, this is linux multipath... the iSCSI is running at the Linux layer... all windows VM's think they are talking to normal physical SCSI drives.
Yes, linux multipath still only has a single path to the server, but the reason I was originally investigating is that it apparently provided better resilience by not timing out and failing requests. In the end, I found the right parameter to tune for the standard iscsi driver in Linux, and tuned that instead. Now I see that multipath also somehow adds caching at the linux layer, so by installing that across all physical boxes, cached iSCSI reads should be a lot faster. Since all the TS boxes only have 100G drives, half of that is free space, and the physical boxes have about 10G free RAM, I can cache about 20% of the HDD. The DC reduces this to about 7% because it has 300G data drive, about 50% full, but has more spare RAM (because it doesn't get the 4G RAM drive for the pagefile)

So, it should improve read performance (for cached reads anyway)

>> I haven't done anything with the partitions as yet, but are you
>basically suggesting the following:
>> 1) Make sure the primary and secondary storage servers are in sync
>and running
>> 2) Remove one SSD from the RAID5, delete the partition, clear the
>superblock/etc
>> 3) Add the same SSD back as /dev/sdx instead of /dev/sdx1
>> 4) Wait for sync
>> 5) Go to 2 with the next SSD etc
>
>No.  Simply execute 'fdisk -lu /dev/sdX' for each SSD and post the
>output.  The critical part is to make sure the partitions start at the
>first sector, and if they don't they should start at a sector number
>divisible by either the physical sector size or the erase block size.
>I'm not sure what the erase block size is for these Intel SSDs.
 
Disk /dev/sdb: 480 GB, 480101368320 bytes
255 heads, 63 sectors/track, 58369 cylinders, total 937697985 sectors
Units = sectors of 1 * 512 = 512 bytes
 
   Device Boot      Start         End      Blocks   Id  System
/dev/sdb1              63   931769999   465884968   fd  Lnx RAID auto

All drives are identically partitioned....
So, the start value should be 1 instead of 63? or should I just get rid of the partitions and use the raw disks as raid members?

The one thing partitioning added was to over-provision and leave a small amount of space at the end of each drive unallocated.... but I don't think that is as important given the comments about that on this list....

>> This would move everything to the beginning of the disk by a small
>amount, but not change anything relatively regarding DRBD/LVM/etc .... 
>
>Oh, ok.  So you already know you created the partitions starting some
>number of sectors after the start of the driver.  If they don't start
>at a sector number described above, that would explain at least some of
>the apparently low block IO performance.
>
>> Would I then need to do further tests to see if I need to do
>something more to move DRBD/LVM to the correct offset to ensure
>alignment? How would I test if that is needed?
>
>Might need to get Neil or Phil, somebody else, involved here.  I'm not
>sure if you'd want to do this on the fly with multiple md rebuilds, or
>if you'd need to blow away the array and start over.  They sit atop md
>and its stripe parameters won't change, so there's probably nothing
>needed to be done with them.

I don't mind multiple rebuilds, since even with a failure during a rebuild, I will have all data on the secondary storage server. Of course I would do this after hours though....

>>>>> Keep us posted.
>>>>
>>>> Will do, I'll have to price up the above options, and get approval
>>> for
>>>> purchase, and then will take a few days to get it all in
>place/etc...
>>>
>>> Given the temperature under the collar of the client, I'd simply
>spend
>>> on adding the 2 bonded ports to the DC box, make all of the LACP
>>> changes, and straighten out alignment/etc issues on the SSDs, md
>stripe
>>> cache, etc.  This will make substantial gains.  Once the client sees
>>> the
>>> positive results, then recommend the HBA for even better
>performance.
>>> Remember, Intel's 520 SSD data shows nearly double the performance
>>> using
>>> SATA3 vs SATA2.  Once you have alignment and md tuning squared away,
>>> moving to the LSI should nearly double your block throughput.
>> 
>> I;d prefer to do everything at once, then they will only pay once,
>and they should see a massive improvement in one jump. Smaller
>incremental improvement is harder for them to see..... Also, the HBA is
>not so expensive, I always assumed they were at least double or more in
>price....
>
>Agreed.  I must have misunderstood the level of, ahem, discontent of
>the
>client.  WRT to the HBAs, you were probably thinking of the full up LSI
>RAID cards, which run ~$350-1400 USD.
>
>> Apologies if the above is 'confused', but I am :)
>
>Hopefully I helped clear things up a bit here.
>
>> PS, was going to move one of the dual port cards from the secondary
>san to the DC machine, but haven't yet since I don't have enough switch
>ports, and now I'm really unsure whether what I have done will be an
>improvement anyway. Will find out tomorrow....
>
>I wasn't aware you were low on Cu ports.

Actually, after driving in (on a sunday) to do this, and not doing it, and now after some sleep, I realise I was wrong. I was MOVING 2 ports from the secondary/idle SAN to a machine. In fact, this would have freed one port.

ie:
remove two ports from san2
remove the single ethernet from DC box
add two ports to DC box

Ooops, amazing what some sleep can do for this...

>> Summary of changes (more for my own reference in case I need to undo
>it tomorrow):
>> 1) disable disk cache on all windows machines
>> 2) san1/2 convert from balance-rr to 802.3ad and add
>xmit_hash_policy=1
>> 3) change switch LAG from Static to LACP
>> 4) install multipath-tools on all physical machines (no config, just
>a reboot)
>
>Hmm... #4  On machines with single NIC ports multipath will do nothing
>good.  On machines with multiple physical interfaces that have been
>bonded, you only have one path, so again, nothing good will arise.
>Maybe you know something here I don't.

All I know is that it seemed to allow Linux to cache the iSCSI reads, which I assume will improve performance by reducing network traffic and load on the SAN...

>Hope things start falling into place for ya.

Well, back to sleep now, but I will find out in 4 hours more when they all get to work whether it is better, worse, or the same.... I'm hoping for a little better since we have:
1) removed the pause at the network layer
2) reduced the iSCSI traffic to a max of 1Gbps, which still floods the single 1Gbps on the DC, but not as badly (ie, can still flood out other SMB traffic, but not as badly I think
3) added iSCSI read caches at the xen hosts

The question remaining will be how this impacts on the TS boxes for access to their local C: data.

So, given the above, would you still suggest only adding a 4port ethernet to the DC box configured with LACP, or should I really look at something else.

1) Adding dual port or quad port to all xen boxes, separate the SAN tarffic from the rest
2) Upgrading to 10G network cards, and maybe 2 x 10G on the SAN
3) Both options will include the LSI HBA anyway

Thanks,
Adam


Regards,
Adam

--
Adam Goryachev
Website Managers
www.websitemanagers.com.au

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance
  2013-02-10 16:16                     ` Adam Goryachev
@ 2013-02-10 17:19                       ` Mikael Abrahamsson
  2013-02-10 21:57                         ` Adam Goryachev
  2013-02-12  2:46                       ` Stan Hoeppner
  1 sibling, 1 reply; 131+ messages in thread
From: Mikael Abrahamsson @ 2013-02-10 17:19 UTC (permalink / raw)
  To: Adam Goryachev; +Cc: Stan Hoeppner

On Mon, 11 Feb 2013, Adam Goryachev wrote:

> Nope, I'm saying that on 5 different (specifically machines 1, 4, 5, 6, 
> 7) physical boxes, (the xen host) if I do a dd 
> if=/dev/disk/by-path/iscsivm1 of=/dev/null on 5 machines concurrently, 
> then they only get 20Mbps each. If I do one at a time, I get 130Mbps, if 
> I do two at a time, I get 60Mbps, etc... If I do the same test on 
> machines 1, 2, 3, 8 at the same time, each gets 130Mbps

When you say Mbps, I read that as Megabit/s. Are you in fact referring to 
megabyte/s?

I suspect the load balancing (hasing) function on the switch terminating 
the LAG is causing your problem. Typically this hashing function doesn't 
look at load on individual links, but a specific src/dst/port hash points 
to a certain link, and there isn't really anything you can do about it. 
The only way around it is to go 10GE instead of the LAG, or move away from 
the LAG and assign 4 different IPs, one per physical link, and then make 
sure routing to/from server/client always goes onto the same link, cutting 
worst-case down to two servers sharing one link (8 servers, 4 links).

> The problem is that (from my understanding) LACP will balance the 
> traffic based on the destination MAC address, by default. So the 
> bandwidth between any two machines is limited to a single 1Gbps link. So 
> regardless of the number of ethernet ports on the DC box, it will only 
> ever use a max of 1Gb[s to talk to the iSCSI server.

LACP is a way to set up a bunch of ports in a channel. It doesn't affect 
how traffic will be shared, that is a property of the hardware/software 
mix in the switch/operating (LACP is control plane, it's not forwarding 
plane). Device egressing the packet onto a link decides what port it goes 
out of, typically done on properties on L2, L3 and L4 (different for 
different devices).

> However, if I configure Linux to use xmit_hash_policy=1 it will use the 
> IP address and port (layer 3+4) to decide which trunk to use. It will 
> still only use 1Gbps to talk to that IP:port combination.

As expected. You do not want to send packets belonging to a single 
"session" out different ports, because then you might get packet 
reordering. This is called "per-packet load sharing", if it's desireable 
then it might be possible to enable in the equipment. TCP doesn't like it 
though, don't know how storage protocols react.

-- 
Mikael Abrahamsson    email: swmike@swm.pp.se

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance
  2013-02-10 17:19                       ` Mikael Abrahamsson
@ 2013-02-10 21:57                         ` Adam Goryachev
  2013-02-11  3:41                           ` Adam Goryachev
  2013-02-11  4:33                           ` Mikael Abrahamsson
  0 siblings, 2 replies; 131+ messages in thread
From: Adam Goryachev @ 2013-02-10 21:57 UTC (permalink / raw)
  To: Mikael Abrahamsson; +Cc: Stan Hoeppner, Dave Cundiff, linux-raid

On 11/02/13 04:19, Mikael Abrahamsson wrote:
> On Mon, 11 Feb 2013, Adam Goryachev wrote:
> 
>> Nope, I'm saying that on 5 different (specifically machines 1, 4, 5,
>> 6, 7) physical boxes, (the xen host) if I do a dd
>> if=/dev/disk/by-path/iscsivm1 of=/dev/null on 5 machines concurrently,
>> then they only get 20Mbps each. If I do one at a time, I get 130Mbps,
>> if I do two at a time, I get 60Mbps, etc... If I do the same test on
>> machines 1, 2, 3, 8 at the same time, each gets 130Mbps
> 
> When you say Mbps, I read that as Megabit/s. Are you in fact referring
> to megabyte/s?

Ooops, my mistake, yes, I meant MB/s for these results, because that is
what dd provides output as.

> I suspect the load balancing (hasing) function on the switch terminating
> the LAG is causing your problem. Typically this hashing function doesn't
> look at load on individual links, but a specific src/dst/port hash
> points to a certain link, and there isn't really anything you can do
> about it. The only way around it is to go 10GE instead of the LAG, or
> move away from the LAG and assign 4 different IPs, one per physical
> link, and then make sure routing to/from server/client always goes onto
> the same link, cutting worst-case down to two servers sharing one link
> (8 servers, 4 links).

Given the flat topology, I think it is difficult (not impossible) to
ensure that both inbound and outbound traffic will be sent/received on
the correct interface. Since the route TO any of the 8 destinations is
on the same network, linux would choose the lowest numbered interface
(AFAIK) for all outbound traffic. Getting the right outbound interface
is the first issue, once solved, ensuring that each interface will only
send an ARP reply for its own IP is the second issue. Both of these are
solvable...

However, this adds lots of complexity, and this system is supposed to
allow heartbeat to automatically move the 'floating' IP to the secondary
server on failure, which certainly adds some complications there also.
It'd be nice to avoid all that, but if that is what is needed, then I'll
have to address all that.

>> The problem is that (from my understanding) LACP will balance the
>> traffic based on the destination MAC address, by default. So the
>> bandwidth between any two machines is limited to a single 1Gbps link.
>> So regardless of the number of ethernet ports on the DC box, it will
>> only ever use a max of 1Gb[s to talk to the iSCSI server.
> 
> LACP is a way to set up a bunch of ports in a channel. It doesn't affect
> how traffic will be shared, that is a property of the hardware/software
> mix in the switch/operating (LACP is control plane, it's not forwarding
> plane). Device egressing the packet onto a link decides what port it
> goes out of, typically done on properties on L2, L3 and L4 (different
> for different devices).
> 
>> However, if I configure Linux to use xmit_hash_policy=1 it will use
>> the IP address and port (layer 3+4) to decide which trunk to use. It
>> will still only use 1Gbps to talk to that IP:port combination.
> 
> As expected. You do not want to send packets belonging to a single
> "session" out different ports, because then you might get packet
> reordering. This is called "per-packet load sharing", if it's desireable
> then it might be possible to enable in the equipment. TCP doesn't like
> it though, don't know how storage protocols react.

Hmmm, so from my reading, it seems that out of order packets will never
be received by the SAN, since the sender only has 1Gbps, and the switch
will only deliver the data over one port anyway.

However, the clients (8 physical machines) would certainly receive out
of order packets, since the SAN is sending over 4 x 1Gbps of data, and
the switch is delivering this too fast to the single 1Gbps port, and so
probably add some packet loss when queues fill up, and this would slow
everything down.

I see a kernel option net.ipv4.tcp_reordering, would setting this value
to a higher figure allow me to use RR for the bonded connections, even
if the server has more total bandwidth than the recipient?

If I use a 10G connection for the SAN, and multiple 1G connections for
the clients, then I will still end up with a max of 1G read speed, since
the switch will only deliver data on a single port. So to get better
than 1G speed, I must use higher bandwidth channels, but using 10G on
all machines allows a single server to "flood" the network...

I suppose accepting max performance of 100MB/s for any individual client
could be acceptable, and if I could ensure that each client would
connect over a distinct port, I could drop in 2 x 4port ethernet devices
to the SAN, but I suspect this won't work because either the switch or
Linux will not properly balance the traffic. Potentially, I could
manually configure the MAC address on the clients, leave Linux to use
MAC based routing, such that the custom MAC address will calculate a
unique port for each. That just leaves the switch sending traffic back
to the SAN, and I don't know how I would go about that... Perhaps it
uses the source MAC address to decide the destination trunk, which will
either work because of the first fix above, or not work because of the
first fix above (if the calculations on Linux are different to the
switch)...

I'm still at a loss on how to correctly configure my network to solve
these issues, any hints would be appreciated.

Regards,
Adam

-- 
Adam Goryachev
Website Managers
www.websitemanagers.com.au

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance
  2013-02-10 21:57                         ` Adam Goryachev
@ 2013-02-11  3:41                           ` Adam Goryachev
  2013-02-11  4:33                           ` Mikael Abrahamsson
  1 sibling, 0 replies; 131+ messages in thread
From: Adam Goryachev @ 2013-02-11  3:41 UTC (permalink / raw)
  To: Mikael Abrahamsson; +Cc: Stan Hoeppner, Dave Cundiff, linux-raid

OK, I'm starting this all over....

At this point, I think regardless of what I do, the maximum bandwidth I
will get is 1Gbps per physical machine (VM server), since the switch
will only ever direct data over a single port (without going to 10Gbps
ethernet).

So, I think the best way to ensure there is always 1Gbps for each
physical machine (VM server) is this:
Get and install 2 x LSI HBA's for the iSCSI servers (1 each) to maximise
performance for the SSD's
Get a 48port switch to support all the additional ethernet ports
Install 8 ethernet ports into the iSCSI server
Install dual ethernet ports into each physical machine (really only need
single, but cost difference is minimal and availability is quicker)
Configure the switch so one port from each of the iSCSI servers plus
both ports on the physical box are on an individual VLAN (ie, 4 ports in
each VLAN)
The physical boxes are configured with ethernet bonding LACP
Configure unique IP addresses/ranges on each VLAN (only need small
subnets, enough for two IP's on each one for the physical machine and
one for the iSCSI server, but since this is totally private IP space it
doesn't matter much anyway).

Now, I lose the current reliability of ethernet connectivity at the
iSCSI server (loss of a single port means loss of a physical machine,
but that is acceptable since the VM can restart on another physical machine.
I get a minimum (and maximum) of 1Gbps of iSCSI performance for each
physical machine, a theoretical maximum of 16Gbps duplex. I don't think
my SSD's will stretch to that performance level (800MB/s read and write
concurrently), but even if it did, I doubt all servers would be asking
to do that anyway.
I get a full 1Gbps for user level data access (SMB/RDP/etc) which is
equivalent to what they had before when all the machines were physical
machines with local HDD's

Thus, I don't have any server sending/receiving data more quickly than
the other side can handle.
I don't have any one server that can steal all available IO from the others

The only downside here is added complexity to setup the 8 networks on
the iSCSI servers (minimal effort), configure the additional ethernet
and bonding on the 8 x physical machines (minimal), and configure the
failover from primary SAN to secondary (more complex, but this isn't
actually a primary concern right now anyway, and to actually run with
the secondary SAN would be a nightmare anyway since it only has 4 x
7200rpm HDD in RAID10, it will need an upgrade to SSD's before it is
really going to be useful).

Finally, the only additional thing I could attempt, would be to
configure the ports on the iSCSI server in pairs, so that a pair of
ports on the iSCSI and all ports from 2 physical machines (total of 8
ports) are on the same VLAN. This will work IF both linux and the switch
will properly balance LACP so that each physical server uses it's own
port. The only thing this adds is resiliency from ethernet failure on
the iSCSI server. If the LACP doesn't properly balance traffic, then I'd
just scratch this and use as above.

Any comments/suggestions?

Regards,
Adam

-- 
Adam Goryachev
Website Managers
www.websitemanagers.com.au


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance
  2013-02-10 21:57                         ` Adam Goryachev
  2013-02-11  3:41                           ` Adam Goryachev
@ 2013-02-11  4:33                           ` Mikael Abrahamsson
  1 sibling, 0 replies; 131+ messages in thread
From: Mikael Abrahamsson @ 2013-02-11  4:33 UTC (permalink / raw)
  To: Adam Goryachev; +Cc: Stan Hoeppner, Dave Cundiff, linux-raid

On Mon, 11 Feb 2013, Adam Goryachev wrote:

> If I use a 10G connection for the SAN, and multiple 1G connections for 
> the clients, then I will still end up with a max of 1G read speed, since 
> the switch will only deliver data on a single port. So to get better 
> than 1G speed, I must use higher bandwidth channels, but using 10G on 
> all machines allows a single server to "flood" the network...

If your equipment supports it, you should put in some kind of policer to 
rate-limit traffic based on destination (or the whole port). Then you 
could limit each server to 2-3 gigabit/s on their 10G port, and the file 
server could get its entire 10G port (or limit that to 5-6 gigabit/s).

-- 
Mikael Abrahamsson    email: swmike@swm.pp.se

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance
  2013-02-07  9:07 ` Dave Cundiff
  2013-02-07 10:19   ` Adam Goryachev
@ 2013-02-11 19:49   ` Roy Sigurd Karlsbakk
  2013-02-11 20:30     ` Dave Cundiff
  1 sibling, 1 reply; 131+ messages in thread
From: Roy Sigurd Karlsbakk @ 2013-02-11 19:49 UTC (permalink / raw)
  To: Dave Cundiff; +Cc: linux-raid, Adam Goryachev

> Why would you plug thousands of dollars of SSD into an onboard
> controller? It's probably running off a 1x PCIE shared with every
> other onboard device. An LSI 8x 8 port HBA will run you a few
> hundred(less than 1 SSD) and let you melt your northbridge. At least
> on my Supermicro X8DTL boards I had to add active cooling to it or it
> would overheat and crash at sustained IO. I can hit 2 - 2.5GB a second
> doing large sequential IO with Samsung 840 Pros on a RAID10.

Those onboard controllers are usually connect to 8x PCIe or similar. Also, those controllers from LSI won't allow TRIM support, which may come in handy…

Vennlige hilsener / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 98013356
roy@karlsbakk.net
http://blogg.karlsbakk.net/
GPG Public key: http://karlsbakk.net/roysigurdkarlsbakk.pubkey.txt
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av idiomer med xenotyp etymologi. I de fleste tilfeller eksisterer adekvate og relevante synonymer på norsk.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance
  2013-02-11 19:49   ` Roy Sigurd Karlsbakk
@ 2013-02-11 20:30     ` Dave Cundiff
  0 siblings, 0 replies; 131+ messages in thread
From: Dave Cundiff @ 2013-02-11 20:30 UTC (permalink / raw)
  To: Roy Sigurd Karlsbakk; +Cc: linux-raid, Adam Goryachev

On Mon, Feb 11, 2013 at 2:49 PM, Roy Sigurd Karlsbakk <roy@karlsbakk.net> wrote:
>> Why would you plug thousands of dollars of SSD into an onboard
>> controller? It's probably running off a 1x PCIE shared with every
>> other onboard device. An LSI 8x 8 port HBA will run you a few
>> hundred(less than 1 SSD) and let you melt your northbridge. At least
>> on my Supermicro X8DTL boards I had to add active cooling to it or it
>> would overheat and crash at sustained IO. I can hit 2 - 2.5GB a second
>> doing large sequential IO with Samsung 840 Pros on a RAID10.
>
> Those onboard controllers are usually connect to 8x PCIe or similar. Also, those controllers from LSI won't allow TRIM support, which may come in handy…
>

Be sure to check your motherboard documentation each time though. It
turned out his was connected to a 4x DMI 2.0 bus which I had mistaken
as a DMI 1.0 even after reading the docs. Thats approximately a 4x
PCI. It was still shared with all the other devices on the motherboard
including another 4x slot and gigabit ethernet adapters. Also it
wasn't a very good SATA controller even. You can get onboards that are
decent. I have several builds that come with onboard LSI SAS
controllers. Just those are hooked directly to the northbridge on a
dedicated 4x PCIE.

The LSI RAID doesn't support TRIM, I don't know any hardware
controller that does at yet. I just use them as plain HBAs with md for
the raid. When I get a kernel with the mdtrim patches TRIM will just
magically(hopefully) start working.


-- 
Dave Cundiff
System Administrator
A2Hosting, Inc
http://www.a2hosting.com
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance
  2013-02-10 16:16                     ` Adam Goryachev
  2013-02-10 17:19                       ` Mikael Abrahamsson
@ 2013-02-12  2:46                       ` Stan Hoeppner
  2013-02-12  5:33                         ` Adam Goryachev
  2013-02-12  7:34                         ` Mikael Abrahamsson
  1 sibling, 2 replies; 131+ messages in thread
From: Stan Hoeppner @ 2013-02-12  2:46 UTC (permalink / raw)
  To: Adam Goryachev; +Cc: Dave Cundiff, linux-raid

If it's OK I'm going to snip a bunch of this and get to the meat of it,
so hopefully it's less confusing.

On 2/10/2013 10:16 AM, Adam Goryachev wrote:
...
...
> The problem is that (from my understanding) LACP will balance the traffic based on the destination MAC address, by default. So the bandwidth between any two machines is limited to a single 1Gbps link. So regardless of the number of ethernet ports on the DC box, it will only ever use a max of 1Gb[s to talk to the iSCSI server.

> However, if I configure Linux to use xmit_hash_policy=1 it will use the IP address and port (layer 3+4) to decide which trunk to use. It will still only use 1Gbps to talk to that IP:port combination.

That is correct.  Long story short, the last time I messed with a
configuration such as this I was using a Cisco that fanned over 802.3ad
groups based on L3/4 info.  Stock 802.3ad won't do this.  I apologize
for the confusion, and for the delay in responding (twas a weekend after
all).  I just finished reading the relevant section of your GS716T-200
(GST716-v2) manual, and it does not appear to have this capability.

All is not lost.  I've done a considerable amount of analysis of all the
information you've provided.  In fact I've spent way to much time on
this.  But it's an intriguing problem involving interesting systems
assembled from channel parts, i.e. "DIY", and I couldn't put it down.  I
was hoping to come up with a long term solution that didn't require any
more hardware than a NIC and HBA, but that's just not really feasible.
So, my conclusions and recommendations, based on all the information I
have to date:

1.  Channel bonding via a single switch using standard link aggregation
    protocols cannot scale iSCSI throughput between two hosts.  The
    various Linux packet fanning modes don't work well here either for
    scaling both transmit and receive traffic.

2.  To scale iSCSI throughput using a single switch will require
    multiple host ports and MPIO, but no LAG for these ports.

3.  Given the facts above, an extra port could be added to each TS Xen
    box.  A separate subnet would be created for the iSCSI SAN traffic,
    and each port given an IP in the subnet.  Both ports would carry
    MPIO iSCSI packets, but only one port would carry user traffic.

4.  Given the fact that there will almost certainly be TS users on the
    target box when the DC VM gets migrated due to some kind of failure
    or maintenance, adding the load of file sharing may not prove
    desirable.  And you'd need another switch.  Thus, I'd recommend:

A.  Dedicate the DC Xen box as a file server and dedicate a non-TS
    Xen box as its failover partner.  Each machine will receive a quad
    port NIC.  Two ports on each host will be connected to the current
    16 port switch.  The two ports will be configured to balance-alb
    using the current user network IP address.  All switch ports will
    be reconfigured to standard mode, no LAGs, as they are not needed
    for Linux balance-alb.  Disconnect the 8111 mobo ports on these two
    boxes from the switch as they're no longer needed.  Prioritize RDP
    in the switch, leave all other protocols alone.

B.  We remove 4 links each from the iSCSI servers, the primary and the
    DRBD backup server, from the switch.  This frees up 8 ports for
    connecting the file servers' 4 ports, and connecting a motherboard
    ethernet port from each iSCSI server to the switch for management.
    If my math is correct this should leave two ports free.

C.  MPIO is designed specifically for IO scaling, and works well.
    So it's a better fit, and you save the cost of the additional
    switch(es) that would be required to do perfect balance-rr bonding
    between iSCSI hosts (which can be done easily with each host
    ethernet port connected to a different dedicated SAN switch.  In
    this case it would require 4 additional switches.  Instead what
    we'll do here is connect the remaining 2 ports from each Xen file
    server box, the primary and the backup, and all 4 ports on each
    iSCSI server, the primary and the backup, to a new 12-16 port
    switch.  It can be any cheap unmanaged GbE switch of 12 or more
    ports.  We'll assign an IP address in the new SAN subnet to each
    physical port on these 4 boxes and configure MPIO accordingly.

    So what we end up with is decent session based scaling of user CIFS
    traffic between the TS hosts and the DC Xen servers, with no single
    TS host bogging everyone down, and no desktop lag if both links are
    full due to two greedy users.  We end up with nearly perfect
    ~200MB/s iSCSI scaling in both directions between the DC Xen box
    (and/or backup) and the iSCSI servers, and we end up with nearly
    perfect ~400MB/s each way between the two iSCSI servers via DRBD,
    allowing you to easily do mirroring in real-time.

All for the cost of two quad port NICs and an inexpensive switch, and
possibly a new high performance SAS HBA.  I analyzed many possible paths
to a solution, and I think this one is probably close to ideal.

You can pull off the same basic concept buying just the quad port HBA
for the current DC Xen box, removing 2 links between each iSCSI server
and the switch and direct connecting these 4 NIC ports via 2 cross over
cables, and using yet another IP subnet for these, with MPIO.  You'd
have no failover for the DC, and the bandwidth between the iSCSI servers
for BRBD would be cut in half.  But it only costs one quad port NIC.  A
dedicated 200MB/s is probably more than plenty for live DRBD, but again
you have no DC failover.

However, given that you've designed this system with "redundancy
everywhere" in mind, I'm guessing the additional redundancy justifies
the capital outlay for an unmanaged switch and a 2nd quad port NIC.

<BIG snip>

> So, given the above, would you still suggest only adding a 4port ethernet to the DC box configured with LACP, or should I really look at something else.

I think LACP is out, regardless of transmit hash mode.

If one of those test boxes could be permanently deployed as the failover
host for the DC VM, I think the dedicated iSCSI switch architecture
makes the most sense long term.  If the cost of the switch and another 4
port NIC isn't in the cards right now, you can go the other route with
just one new NIC.  And given that you'll be doing no ethernet channel
bonding on the iSCSI network, but IP based MPIO instead, it's a snap to
convert to the redundant architecture with new switch later.  All you'll
be doing is swapping cables to the new switch and changing IP address
bindings on the NICs as needed.

Again, apologies for the false start with the 802.3ad confusion on my
part.  I think you'll find all (or at least most) of the ducks in a row
in the recommendations above.

-- 
Stan


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance
  2013-02-12  2:46                       ` Stan Hoeppner
@ 2013-02-12  5:33                         ` Adam Goryachev
  2013-02-13  7:56                           ` Stan Hoeppner
  2013-02-12  7:34                         ` Mikael Abrahamsson
  1 sibling, 1 reply; 131+ messages in thread
From: Adam Goryachev @ 2013-02-12  5:33 UTC (permalink / raw)
  To: stan; +Cc: Dave Cundiff, linux-raid

On 12/02/13 13:46, Stan Hoeppner wrote:
> If it's OK I'm going to snip a bunch of this and get to the meat of it,
> so hopefully it's less confusing.

Thanks, was getting way over the top :)

> That is correct.  Long story short, the last time I messed with a
> configuration such as this I was using a Cisco that fanned over 802.3ad
> groups based on L3/4 info.  Stock 802.3ad won't do this.  

Yes, Cisco have their own proprietary extensions... EtherChannel I think
it is called.

> I apologize
> for the confusion, and for the delay in responding (twas a weekend after
> all).

No problem, I expected as much... Just because I'm silly enough to work
on a weekend, I realise most others don't. Besides, any help I get here
is a bonus :)

However, I did end up already making the solution proposal to the
client, and have already ordered some equipment, but see below...

>  I just finished reading the relevant section of your GS716T-200
> (GST716-v2) manual, and it does not appear to have this capability.

Nope.

> All is not lost.  I've done a considerable amount of analysis of all the
> information you've provided.  In fact I've spent way to much time on
> this.  But it's an intriguing problem involving interesting systems
> assembled from channel parts, i.e. "DIY", and I couldn't put it down.  I
> was hoping to come up with a long term solution that didn't require any
> more hardware than a NIC and HBA, but that's just not really feasible.

That's OK, I was fully prepared to get additional equipment, and the
customer was happy to throw money at it to get it fixed...

> So, my conclusions and recommendations, based on all the information I
> have to date:
> 
> 2.  To scale iSCSI throughput using a single switch will require
>     multiple host ports and MPIO, but no LAG for these ports.

I'm assuming MPIO is Multi Path IO (ie, MultiPath iSCSI)?

> 3.  Given the facts above, an extra port could be added to each TS Xen
>     box.  A separate subnet would be created for the iSCSI SAN traffic,
>     and each port given an IP in the subnet.  Both ports would carry
>     MPIO iSCSI packets, but only one port would carry user traffic.

This would allow iSCSI up to 2Gbit bi-directional traffic per xen box,
though some of it would also be consumed for the VM's. Also, the iSCSI
server would only be capable of a total 2Gbps on each network, so it
could handle two xen boxes demanding 100% throughput, which is a total
of 4Gbps which is pretty impressive (assuming SAN server uses
balance-alb). However, ignore this, I'll concentrate on what you suggest
below.

> 4.  Given the fact that there will almost certainly be TS users on the
>     target box when the DC VM gets migrated due to some kind of failure
>     or maintenance, adding the load of file sharing may not prove
>     desirable.  And you'd need another switch.  Thus, I'd recommend:
> 
> A.  Dedicate the DC Xen box as a file server and dedicate a non-TS
>     Xen box as its failover partner.  Each machine will receive a quad
>     port NIC.  Two ports on each host will be connected to the current
>     16 port switch.  The two ports will be configured to balance-alb
>     using the current user network IP address.  All switch ports will
>     be reconfigured to standard mode, no LAGs, as they are not needed
>     for Linux balance-alb.  Disconnect the 8111 mobo ports on these two
>     boxes from the switch as they're no longer needed.  Prioritize RDP
>     in the switch, leave all other protocols alone.

BTW, the switch has a maximum of 4 LAG's, so one option I was going to
try would not have worked anyway. Though that was probably just bad
design on my part... I think I'm passed that now :)

> B.  We remove 4 links each from the iSCSI servers, the primary and the
>     DRBD backup server, from the switch.  This frees up 8 ports for
>     connecting the file servers' 4 ports, and connecting a motherboard
>     ethernet port from each iSCSI server to the switch for management.
>     If my math is correct this should leave two ports free.

I already have one motherboard port from SAN1/2 connected to another
switch, and also one motherboard port is a direct crossover cable
between san1 and san2 which is configured for DRBD traffic sync (so this
traffic is kept away from the iSCSI traffic).

However, after this, the only connection between the xen boxes running
the terminal servers to the iSCSI server is the single "management"
ethernet port. The Terminal Servers C: is also on the iSCSI server... so
this doesn't quite work.

> C.  MPIO is designed specifically for IO scaling, and works well.
>     So it's a better fit, and you save the cost of the additional
>     switch(es) that would be required to do perfect balance-rr bonding
>     between iSCSI hosts (which can be done easily with each host
>     ethernet port connected to a different dedicated SAN switch.  In
>     this case it would require 4 additional switches.

I assume this means that if you have a quad port card in each machine,
with a single ethernet connected to each of 4 switches, then you can do
balance-rr because bandwidth on both endpoints is equal ? That doesn't
quite work for me because I don't want the expense of a quad port card
in each machine, and also I don't want equal bandwidth.... I want the
server to have more bandwidth than the clients. In any case, let's
ignore this since it doesn't get us closer to the solution.

>     Instead what
>     we'll do here is connect the remaining 2 ports from each Xen file
>     server box, the primary and the backup, and all 4 ports on each
>     iSCSI server, the primary and the backup, to a new 12-16 port
>     switch.  It can be any cheap unmanaged GbE switch of 12 or more
>     ports.  We'll assign an IP address in the new SAN subnet to each
>     physical port on these 4 boxes and configure MPIO accordingly.

As mentioned, this cuts off the iSCSI from the rest of the 6 xen boxes.

>     So what we end up with is decent session based scaling of user CIFS
>     traffic between the TS hosts and the DC Xen servers, with no single
>     TS host bogging everyone down, and no desktop lag if both links are
>     full due to two greedy users.  We end up with nearly perfect
>     ~200MB/s iSCSI scaling in both directions between the DC Xen box
>     (and/or backup) and the iSCSI servers, and we end up with nearly
>     perfect ~400MB/s each way between the two iSCSI servers via DRBD,
>     allowing you to easily do mirroring in real-time.

I'm assuming MPIO requires the following:
SAN must have multiple physical links over 'disconnected' networks (ie,
different networks) on different subnets.
iSCSI client must meet the same requirements.

> All for the cost of two quad port NICs and an inexpensive switch, and
> possibly a new high performance SAS HBA.  I analyzed many possible paths
> to a solution, and I think this one is probably close to ideal.

OK, what about this option:

Install dual port ethernet card into each of the 8 xen boxes
Install 2 x quad port ethernet card into each of the san boxes

Connect one port from each of the xen boxes plus 4 ports from each san
box to a single switch (16ports)

Connect the second port from each of the xen boxes plus 4 ports from
each san box to a second switch (16 ports)

Connect the motherboard port (existing) from each of the xen boxes plus
one port from each of the SAN boxes (management port) to a single switch
(10 ports).

Total of 42 ports.

Leave the existing motherboard port configured with existing IP's/etc,
and dedicate this as the management/user network (RDP/SMB/etc).

We then configure the SAN boxes with two bond devices, each consisting
of a set of 4 x 1Gbps as balance-alb, with one IP address each (from 2
new subnets).

Add a "floating" IP to the current primary SAN on each of the bond
interfaces from the new subnets.

We configure each of the xen boxes with two new ethernets with one IP
address each (from the 2 new subnets).

Configure multipath to talk to the two floating IP's

See a rough sketch at:
http://suspended.wesolveit.com.au/graphs/diagram.JPG
I couldn't fit any detail like IP addresses without making it a complete
mess. BTW, sw1 and sw2 I'm thinking can be the same physical switch,
using VLAN to make them separate (although different physical switches
adds to the reliability factor, so that is also something to think about).

Now, this provides up to 2Gbps traffic for any one host, and up to 8Gbps
traffic in total for the SAN server, which is equivalent to 4 clients at
full speed.

It also allows for the user network to operate at a full 1Gbps for
SMB/RDP/etc, and I could still prioritise RDP at the switch....

I'm thinking 200MB/s should be enough performance for any one machine
disk access, and 1Gbps for any single user side network access should be
ample given this is the same as what they had previously.

The only question left is what will happen when there is only one xen
box asking to read data from the SAN? Will the SAN attempt to send the
data at 8Gbps, flooding the 2Gbps that the client can handle, and
generate all the pause messages, or is this not relevant and it will
"just work". Actually, I think from reading the docs, it will only use
one link out of each group of 4 to send the data, hence it won't attempt
to send at more than 2Gbps to each client....

I don't think this system will scale any further than this, I can only
add additional single Gbps ports to the xen hosts, and I can only add
one extra 4 x 1Gbps ports to each SAN server.... Best case is add 4 x
10Gbps to the SAN, 2 single 1Gbps ports to each xen, providing a full
32Gbps to the clients, each client gets max 4Gbps. In any case, I think
that would be one kick-ass network, besides being a pain to try and
debug, keep cabling neat and tidy, etc... Oh, and the current SSD's
wouldn't be that fast... At 400MB/s read, times 7 data disks is
2800GB/s, actually, damn, that's fast.

The only additional future upgrade I would plan is to upgrade the
secondary san to use SSD's matching the primary. Or add additional SSD's
to expand storage capacity and I guess speed. I may also need to add
additional ethernet ports to both SAN1 and SAN2 to increase the DRBD
cross connects, but these would I assume be configured using linux
bonding in balance-rr since there is no switch in between.

> You can pull off the same basic concept buying just the quad port HBA
> for the current DC Xen box, removing 2 links between each iSCSI server
> and the switch and direct connecting these 4 NIC ports via 2 cross over
> cables, and using yet another IP subnet for these, with MPIO.  You'd
> have no failover for the DC, and the bandwidth between the iSCSI servers
> for BRBD would be cut in half.  But it only costs one quad port NIC.  A
> dedicated 200MB/s is probably more than plenty for live DRBD, but again
> you have no DC failover.
> 
> However, given that you've designed this system with "redundancy
> everywhere" in mind, I'm guessing the additional redundancy justifies
> the capital outlay for an unmanaged switch and a 2nd quad port NIC.

Let's ignore this... we both agree it isn't a good solution.

> If one of those test boxes could be permanently deployed as the failover
> host for the DC VM, I think the dedicated iSCSI switch architecture
> makes the most sense long term.  If the cost of the switch and another 4
> port NIC isn't in the cards right now, you can go the other route with
> just one new NIC.  And given that you'll be doing no ethernet channel
> bonding on the iSCSI network, but IP based MPIO instead, it's a snap to
> convert to the redundant architecture with new switch later.  All you'll
> be doing is swapping cables to the new switch and changing IP address
> bindings on the NICs as needed.

I'd rather keep all boxes with identical hardware, so that any VM can be
run on any xen host.

So, the current purchase list, which the customer approved yesterday,
and most of it should be delivered tomorrow (insufficient stock, already
ordering from 4 different wholesalers):
4 x Quad port 1Gbps cards
4 x Dual port 1Gbps cards
2 x LSI HBA's (the suggested model)
1 x 48port 1Gbps switch (same as the current 16port, but more ports).

The idea being to pull out 4 x dual port cards from san1/2 and install
the 4 x quad port cards. Then install a single dual port card on each
xen box. Install one LSI HBA in each san box. Use the 48 port switch to
connect it all together.

However, I'm going to be short 1 x quad ethernet, and 1 x sata
controller, so the secondary san is going to be even more lacking for up
to 2 weeks when these parts arrive, but IMHO, that is not important at
this stage, if san1 falls over, I'm going to be screwed anyway running
on spinning disks :) though not as screwed as being plain
down/offline/nothing/just go home folks...

> Again, apologies for the false start with the 802.3ad confusion on my
> part.  I think you'll find all (or at least most) of the ducks in a row
> in the recommendations above.

No problem, this has been a definite learning experience for me and I
appreciate all the time and effort you've put into assisting.

BTW, I went last night (monday night) and removed one dual port card
from the san2, installed into the xen host running the DC VM. Configured
the two new ports on the xen box as active-backup (couldn't get LACP to
work since the switch only supports max of 4 LAG's anyway). Removed one
port from the LAG on san1, and setup the three ports (1 x san + 2 x
xen1) as a VLAN with private IP address on a new subnet. Today,
complaints have been non-existant, mostly relating to issues they had
yesterday but didn't bother to call until today. It's now 4:30pm, so I'm
thinking that the problem is solved just with that done. I was going to
do this across all 8 boxes, using 2 x ethernet on each xen box plus one
x ethernet on each san, producing a max of 1Gbps ethernet for each xen
box. However, I think your suggestion of MPIO is much better, and
grouping the SAN ports into two bundles makes a lot more sense, and
produces 2Gbps per xen box.

Thanks again, I appreciate all the help.

Regards,
Adam

-- 
Adam Goryachev
Website Managers
www.websitemanagers.com.au

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance
  2013-02-12  2:46                       ` Stan Hoeppner
  2013-02-12  5:33                         ` Adam Goryachev
@ 2013-02-12  7:34                         ` Mikael Abrahamsson
  1 sibling, 0 replies; 131+ messages in thread
From: Mikael Abrahamsson @ 2013-02-12  7:34 UTC (permalink / raw)
  To: Stan Hoeppner; +Cc: linux-raid

On Mon, 11 Feb 2013, Stan Hoeppner wrote:

> groups based on L3/4 info.  Stock 802.3ad won't do this.  I apologize

It's my understanding that 802.3ad (LACP) doesn't specify how packets 
should be spread over available ports, that is up to the implementor to 
decide, and it's always the packet egress device that decides what link to 
put the packet on. 802.3ad LACP is control plane, not forwarding plane.

-- 
Mikael Abrahamsson    email: swmike@swm.pp.se

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance
  2013-02-12  5:33                         ` Adam Goryachev
@ 2013-02-13  7:56                           ` Stan Hoeppner
  2013-02-13 13:48                             ` Phil Turmel
  2013-02-13 16:17                             ` Adam Goryachev
  0 siblings, 2 replies; 131+ messages in thread
From: Stan Hoeppner @ 2013-02-13  7:56 UTC (permalink / raw)
  To: Adam Goryachev; +Cc: Dave Cundiff, linux-raid

On 2/11/2013 11:33 PM, Adam Goryachev wrote:

> I'm assuming MPIO is Multi Path IO (ie, MultiPath iSCSI)?

Yes.  Shorter to type. ;)

> I assume this means that if you have a quad port card in each machine,
> with a single ethernet connected to each of 4 switches, then you can do
> balance-rr because bandwidth on both endpoints is equal ? 

It's not simply that bandwidth is equal, but that the ports are
symmetrical.  Every source port on a host has only one path and one
destination port on another host.  It's identical to using crossover
cables between hosts, but the multiple independent switches allow for
more hosts to participate than could otherwise if using crossover cables.

> As mentioned, this cuts off the iSCSI from the rest of the 6 xen boxes.

Palm, meet forehead.  I forgot you were using iSCSI for anything other
than live migrating the DC VM amongst the Xen hosts.

> I'm assuming MPIO requires the following:
> SAN must have multiple physical links over 'disconnected' networks (ie,
> different networks) on different subnets.
> iSCSI client must meet the same requirements.

I fubar'd this.  See below for a thorough explanation.  The IPs should
all be in the same subnet.

> OK, what about this option:
> 
> Install dual port ethernet card into each of the 8 xen boxes
> Install 2 x quad port ethernet card into each of the san boxes
> 
> Connect one port from each of the xen boxes plus 4 ports from each san
> box to a single switch (16ports)
> 
> Connect the second port from each of the xen boxes plus 4 ports from
> each san box to a second switch (16 ports)
> 
> Connect the motherboard port (existing) from each of the xen boxes plus
> one port from each of the SAN boxes (management port) to a single switch
> (10 ports).
> 
> Total of 42 ports.
> 
> Leave the existing motherboard port configured with existing IP's/etc,
> and dedicate this as the management/user network (RDP/SMB/etc).

Keeping the LAN and SAN traffic on different segments is a big plus.
But I still wonder if a single link for SMB traffic is enough for that
greedy bloke moving 50GB files over the network.

> We then configure the SAN boxes with two bond devices, each consisting
> of a set of 4 x 1Gbps as balance-alb, with one IP address each (from 2
> new subnets).

Use MPIO (multipath) only.  Do not use channel bonding.  MPIO runs
circles around channel bonding.  Read this carefully.  I'm pretty sure
you'll like this.

Ok, so this fibre channel guy has been brushing up a bit on iSCSI
multipath, and it looks like the IP subnetting is a non issue, and after
thinking it through I've put palm to forehead.  As long as you have an
IP path between ethernet ports, the network driver uses the MAC address
from that point forward, DUH!.  Remember that IP addresses exist solely
for routing packets from one network to another.  But within a network
the hardware address is used, i.e. the MAC address.  This has been true
for the 30+ years of networking.  Palm to forehead again. ;)

So, pick a unique subnet for SAN traffic, and assign an IP and
appropriate mask to each physical port in all the machines' iSCSI ports.
 The rest of the iSCSI setup you already know how to do.  The only
advice I can give you here is to expose every server target LUN out
every physical port so the Xen box ports see the LUNs on every server
port.  I assume you've already done this with the current IP subnet, as
it's required to live migrate your VMs amongst the Xen servers.  So you
just need to change it over for the new SAN specific subnet/ports.  Now,
when you run 'multipath -ll' on each Xen box it'll see all the LUNs on
all 8 ports of each iSCSI server (as well as local disks), and
automatically do round robin fanning of SCSI block IO packets across all
8 server ports.  You may need to blacklist local devices.  You'll
obviously want to keep the LUNs on the standby iSCSI server masked, or
simply not used until needed.

You only install the multipath driver on the initiators (Xen clients),
--NOT ON THE TARGETS (servers)-- .  All block IO transactions are
initiated by the client (think desktop PC with single SATA drive--who
talks first, mobo or drive?).  The iSCSI server will always reply on the
port a packet arrived on.  So, you get automatic perfect block IO
scaling on all server ports, all the time, no matter how many clients
are talking.  Told ya you'd like this. ;)  Here's a relatively
informative read on multipath:

http://linfrastructure.blogspot.com/2008/02/multipath-and-equallogic-iscsi.html

> Add a "floating" IP to the current primary SAN on each of the bond
> interfaces from the new subnets.

No, see above.

> We configure each of the xen boxes with two new ethernets with one IP
> address each (from the 2 new subnets).
> 
> Configure multipath to talk to the two floating IP's

See above.

> See a rough sketch at:
> http://suspended.wesolveit.com.au/graphs/diagram.JPG
> I couldn't fit any detail like IP addresses without making it a complete
> mess. BTW, sw1 and sw2 I'm thinking can be the same physical switch,
> using VLAN to make them separate (although different physical switches
> adds to the reliability factor, so that is also something to think about).
> 
> Now, this provides up to 2Gbps traffic for any one host, and up to 8Gbps
> traffic in total for the SAN server, which is equivalent to 4 clients at
> full speed.

With multipath, this architecture configured as above is going to be
pretty speedy.

> It also allows for the user network to operate at a full 1Gbps for
> SMB/RDP/etc, and I could still prioritise RDP at the switch....

Prioritizing RDP is a necessity for responsiveness.  But unloading the
SAN traffic from that single interface makes a huge difference, as
you've already seen.

> I'm thinking 200MB/s should be enough performance for any one machine
> disk access, and 1Gbps for any single user side network access should be
> ample given this is the same as what they had previously.

Coincidentally, the last 'decent' size network I managed had 525'ish
users, but our 4 Citrix servers were bare metal blades.  All our CIFS
traffic hit a single blade's GbE port.  That blade, ESX3, hosted our DC
file server VM and 6 other Linux and Windows VMs, some of which had
significant traffic.  User traffic HBA was single GbE, and the SAN HBA
was 2Gb/s fibre channel.  Same bandwidth as your soon-to-be setup.
Though the backend was different, one FasTt600 and one SataBlade, each
with a single 2Gb FC link.

> The only question left is what will happen when there is only one xen
> box asking to read data from the SAN? Will the SAN attempt to send the
> data at 8Gbps, flooding the 2Gbps that the client can handle, and

You're not using balance-rr inappropriately here.  So, no, this isn't an
issue.  In a request/reply chain, responses will only go out at the same
rate requests come in.  With two ports making requests to 8 ports, each
of the 8 will receive and reply to 1/4th of the total requests,
generating ~1/4th of the total bandwidth.  200/8=25, so each of the 8
ports will transmit ~25MB/s in replies.  Reply packets will be larger
due to the data payload, but the packet queuing, window scaling, etc in
the receiver TCP stack FOR EACH PORT will slow down the sender when
necessary.

And you're now wondering why TCP packet queuing didn't kick in with
balance-rr, causing all of those ethernet pause frames and other issues.
 Answer:  I think the problem was that the TCP back off features were
short circuited.  When you were using balance-rr, packets were likely
arriving wildly out of sequence, from the same session, but from
different MAC addresses.  You were sending from one IP stack out 4 MACs
to one 1 IP stack on one MAC.

With multipath iSCSI, each MAC has its own IP address and own TCP stack,
so all packets always arrive in order or are easily reordered, allowing
packet queuing, window scaling, etc, to work properly.  Balance-rr works
in the cluster scenario up top because the packets still arrive in
sequence, even though on different ports from different MACs.  Say
1/2/3/4, 5/6/7/8, etc.  Previously you probably had packet ordering
something like 4/1/3/2/7/8/6/5 on occasion.  This short circuited the
receiving TCP stack preventing it from sending back offs.  The TCP stack
on the server thought all was fine and kept slinging packets until the
switch started sending back ethernet pause frames.  I'm not enough of an
ethernet or TCP expert to explain what happens next, but I'd bet those
Windows write errors are related to this.

Again, ya don't have to worry about any of this mess using multipath.
And, you get full port balanced bandwidth on all Xen hosts, and all
server ports, all the time.  Pretty slick.

> generate all the pause messages, or is this not relevant and it will
> "just work". Actually, I think from reading the docs, it will only use
> one link out of each group of 4 to send the data, hence it won't attempt
> to send at more than 2Gbps to each client....

See above.

> I don't think this system will scale any further than this, I can only
> add additional single Gbps ports to the xen hosts, and I can only add
> one extra 4 x 1Gbps ports to each SAN server.... Best case is add 4 x
> 10Gbps to the SAN, 2 single 1Gbps ports to each xen, providing a full
> 32Gbps to the clients, each client gets max 4Gbps. In any case, I think
> that would be one kick-ass network, besides being a pain to try and
> debug, keep cabling neat and tidy, etc... Oh, and the current SSD's
> wouldn't be that fast... At 400MB/s read, times 7 data disks is
> 2800GB/s, actually, damn, that's fast.

You could easily get by with a single quad port NIC on iSCSI target duty
in each server.  That's 800MB/s duplex, same as 4Gb fibre channel.
That's more than sufficient for your 8 Xen nodes, especially given the
bulk of your traffic is SMB, which is limited to ~100MB/s, which means
~100MB/s on the SAN, with ~300MB/s breathing room.  Use the other quad
port NIC direct connected with x-over cables to the other server for
DRBD using balance-rr.

> The only additional future upgrade I would plan is to upgrade the
> secondary san to use SSD's matching the primary. Or add additional SSD's
> to expand storage capacity and I guess speed. I may also need to add
> additional ethernet ports to both SAN1 and SAN2 to increase the DRBD
> cross connects, but these would I assume be configured using linux
> bonding in balance-rr since there is no switch in between.

See above.

> I'd rather keep all boxes with identical hardware, so that any VM can be
> run on any xen host.

Looks like you've got the right architecture for it nailed down now.

> So, the current purchase list, which the customer approved yesterday,
> and most of it should be delivered tomorrow (insufficient stock, already
> ordering from 4 different wholesalers):
> 4 x Quad port 1Gbps cards
> 4 x Dual port 1Gbps cards
> 2 x LSI HBA's (the suggested model)
> 1 x 48port 1Gbps switch (same as the current 16port, but more ports).

And more than sufficient hardware.  I was under the impression that this
much capital was not available, or I'd had different recommendations.
One being very similar to what you came up with here.

> The idea being to pull out 4 x dual port cards from san1/2 and install
> the 4 x quad port cards. Then install a single dual port card on each
> xen box. Install one LSI HBA in each san box. Use the 48 port switch to
> connect it all together.

> However, I'm going to be short 1 x quad ethernet, and 1 x sata
> controller, so the secondary san is going to be even more lacking for up
> to 2 weeks when these parts arrive, but IMHO, that is not important at
> this stage, if san1 falls over, I'm going to be screwed anyway running
> on spinning disks :) though not as screwed as being plain
> down/offline/nothing/just go home folks...

Two words:  Murphy's law

;)

> No problem, this has been a definite learning experience for me and I
> appreciate all the time and effort you've put into assisting.

There are millions of folks spewing vitriol at one another at any moment
on the net.  I prefer to be constructive, help people out when I can,
pass a little knowledge and creativity when possible, learn things
myself.  That's not to say I don't pop at folks now and then when
frustration boils. ;)  I'm human too.

> BTW, I went last night (monday night) and removed one dual port card
> from the san2, installed into the xen host running the DC VM. Configured
> the two new ports on the xen box as active-backup (couldn't get LACP to
> work since the switch only supports max of 4 LAG's anyway). Removed one
> port from the LAG on san1, and setup the three ports (1 x san + 2 x
> xen1) as a VLAN with private IP address on a new subnet.  Today,
> complaints have been non-existant, mostly relating to issues they had
> yesterday but didn't bother to call until today. It's now 4:30pm, so I'm
> thinking that the problem is solved just with that done.

So the biggest part of the problem was simply SMB and iSCSI on the same
link on the DC.  Let's see how the new system does with that user in
need of a clue stick, doing his 50GB SMB xfer when all users are humming
away.

> I was going to
> do this across all 8 boxes, using 2 x ethernet on each xen box plus one
> x ethernet on each san, producing a max of 1Gbps ethernet for each xen
> box. However, I think your suggestion of MPIO is much better, and
> grouping the SAN ports into two bundles makes a lot more sense, and
> produces 2Gbps per xen box.

Nope, no bundles.  As with balance-rr, MPIO is awesome when deployed
properly.

> Thanks again, I appreciate all the help.

I appreciate the fact you posted the topic.  I had to go (re)learn a
little bit myself.  I just hope read you this email before trying to
stick MPIO on top of a channel bond. ;)

Send me a picture of the racked gear when it's all done, front, and
back, so I can see how ugly that Medusa is, and remind myself of one of
many reasons I prefer fibre channel. ;)

-- 
Stan


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance
  2013-02-13  7:56                           ` Stan Hoeppner
@ 2013-02-13 13:48                             ` Phil Turmel
  2013-02-13 16:17                             ` Adam Goryachev
  1 sibling, 0 replies; 131+ messages in thread
From: Phil Turmel @ 2013-02-13 13:48 UTC (permalink / raw)
  To: stan; +Cc: Adam Goryachev, Dave Cundiff, linux-raid

On 02/13/2013 02:56 AM, Stan Hoeppner wrote:
> On 2/11/2013 11:33 PM, Adam Goryachev wrote:

>> No problem, this has been a definite learning experience for me and I
>> appreciate all the time and effort you've put into assisting.
> 
> There are millions of folks spewing vitriol at one another at any moment
> on the net.  I prefer to be constructive, help people out when I can,
> pass a little knowledge and creativity when possible, learn things
> myself.  That's not to say I don't pop at folks now and then when
> frustration boils.   I'm human too.

> I appreciate the fact you posted the topic.  I had to go (re)learn a
> little bit myself.  I just hope read you this email before trying to
> stick MPIO on top of a channel bond. ;)

Allow me to echo these sentiments.  This has been a singularly positive
thread with a wealth of new information (for me).

> Send me a picture of the racked gear when it's all done, front, and
> back, so I can see how ugly that Medusa is, and remind myself of one of
> many reasons I prefer fibre channel. ;)

+1

Regards,

Phil

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance
  2013-02-13  7:56                           ` Stan Hoeppner
  2013-02-13 13:48                             ` Phil Turmel
@ 2013-02-13 16:17                             ` Adam Goryachev
  2013-02-13 20:20                               ` Adam Goryachev
  1 sibling, 1 reply; 131+ messages in thread
From: Adam Goryachev @ 2013-02-13 16:17 UTC (permalink / raw)
  To: stan; +Cc: Dave Cundiff, linux-raid

Stan Hoeppner <stan@hardwarefreak.com> wrote:

>On 2/11/2013 11:33 PM, Adam Goryachev wrote:
>> I'm assuming MPIO requires the following:
>> SAN must have multiple physical links over 'disconnected' networks
>>(ie, different networks) on different subnets.
>> iSCSI client must meet the same requirements.
>
>I fubar'd this.  See below for a thorough explanation.  The IPs should
>all be in the same subnet.
>
>> OK, what about this option:
>> 
>> Install dual port ethernet card into each of the 8 xen boxes
>> Install 2 x quad port ethernet card into each of the san boxes
>> 
>> Connect one port from each of the xen boxes plus 4 ports from each
>> san box to a single switch (16ports)
>> 
>> Connect the second port from each of the xen boxes plus 4 ports from
>> each san box to a second switch (16 ports)
>> 
>> Connect the motherboard port (existing) from each of the xen boxes
>> plus one port from each of the SAN boxes (management port) to a single
>> switch (10 ports).
>> 
>> Total of 42 ports.
>> 
>> Leave the existing motherboard port configured with existing
>IP's/etc,
>> and dedicate this as the management/user network (RDP/SMB/etc).
>
>Keeping the LAN and SAN traffic on different segments is a big plus.
>But I still wonder if a single link for SMB traffic is enough for that
>greedy bloke moving 50GB files over the network.
>
>> We then configure the SAN boxes with two bond devices, each
>> consisting of a set of 4 x 1Gbps as balance-alb, with one IP
>> address each (from 2 new subnets).
>
>Use MPIO (multipath) only.  Do not use channel bonding.  MPIO runs
>circles around channel bonding.  Read this carefully.  I'm pretty sure
>you'll like this.
>
>Ok, so this fibre channel guy has been brushing up a bit on iSCSI
>multipath, and it looks like the IP subnetting is a non issue, and
>after
>thinking it through I've put palm to forehead.  As long as you have an
>IP path between ethernet ports, the network driver uses the MAC address
>from that point forward, DUH!.  Remember that IP addresses exist solely
>for routing packets from one network to another.  But within a network
>the hardware address is used, i.e. the MAC address.  This has been true
>for the 30+ years of networking.  Palm to forehead again. ;)
>
>So, pick a unique subnet for SAN traffic, and assign an IP and
>appropriate mask to each physical port in all the machines' iSCSI
>ports.

There are a couple of problems I'm having with this solution.

I've created 8 IP's (one on each eth interface on san1, all in the same subnet /24) and another 2 IP's on xen1 for it's 2 eth interfaces, again on the same subnet.

The initial reason I knew this wouldn;t work is that linux will see the arp request (broadcast) and respond from any interface. For example, from xen1 I ping each IP on san1, then do a arp -an and see the same MAC address for every IP (or sometimes I will see two or even three different MAC addresses, but mostly the same. I do arp -d in between to restart the test clean each time).

So, after some initial reading (since I remembered from years ago this was solvable) I did:
echo 1 > /proc/sys/net/ipv4/conf/all/arp_ignore
echo 2 > /proc/sys/net/ipv4/conf/all/arp_announce

Now I get a unique MAC for each IP (perfect)

What about balancing the reverse traffic. 
Assuming I set the same options on xen1, then san1 will reply to the different IP's on different ethernet ports for xen1.
Except, when xen1 sends any TCP request to any of san1 IP's, it will always come from the same IP, so san1 will always respond to the same IP. This was why I suggested using two distinct subnets/LAN's/switches. 

This means I'm limited to only 1Gbps outbound, and therefore only 1Gbps inbound since the san1 will always reply to the same IP.....

If I use two subnets, multipath will use the first ethernet interface/IP to talk to the san1 on it's first IP and the second ethernet/ip to talk to san1 second subnet/ip

I thought to use bonding with balance-alb for the 4 ports on san1 so that it will use a max of 1 out of 4 ports to talk to one client (avoid overloading the client) and also dynamically balance all clients for inbound/outbound with fancy arp announcing.

> The rest of the iSCSI setup you already know how to do.  The only
>advice I can give you here is to expose every server target LUN out
>every physical port so the Xen box ports see the LUNs on every server
>port.  I assume you've already done this with the current IP subnet, as
>it's required to live migrate your VMs amongst the Xen servers.  So you
>just need to change it over for the new SAN specific subnet/ports. 
>Now,
>when you run 'multipath -ll' on each Xen box it'll see all the LUNs on
>all 8 ports of each iSCSI server (as well as local disks), and
>automatically do round robin fanning of SCSI block IO packets across
>all
>8 server ports.  You may need to blacklist local devices.  You'll
>obviously want to keep the LUNs on the standby iSCSI server masked, or
>simply not used until needed.
>
>You only install the multipath driver on the initiators (Xen clients),
>--NOT ON THE TARGETS (servers)-- .  All block IO transactions are
>initiated by the client (think desktop PC with single SATA drive--who
>talks first, mobo or drive?).  The iSCSI server will always reply on
>the
>port a packet arrived on.  So, you get automatic perfect block IO
>scaling on all server ports, all the time, no matter how many clients
>are talking.  Told ya you'd like this. ;)  Here's a relatively
>informative read on multipath:
>
>http://linfrastructure.blogspot.com/2008/02/multipath-and-equallogic-iscsi.html

OK, so I need to use iscsiadm and bind it to the individual mac/interface... now I see how this will work

>> Add a "floating" IP to the current primary SAN on each of the bond
>> interfaces from the new subnets.
>No, see above.

>> We configure each of the xen boxes with two new ethernets with one IP
>> address each (from the 2 new subnets).
>> 
>> Configure multipath to talk to the two floating IP's

So, to do the failover, I just need to stop ietd on san1, and start ietd on san2.... as long as I did a discovery at some point while san2 was running so that it knows that is a possible path to the devices...

>> I'm thinking 200MB/s should be enough performance for any one machine
>> disk access, and 1Gbps for any single user side network access should
>be
>> ample given this is the same as what they had previously.
>
>Coincidentally, the last 'decent' size network I managed had 525'ish
>users, but our 4 Citrix servers were bare metal blades.  All our CIFS
>traffic hit a single blade's GbE port.  That blade, ESX3, hosted our DC
>file server VM and 6 other Linux and Windows VMs, some of which had
>significant traffic.  User traffic HBA was single GbE, and the SAN HBA
>was 2Gb/s fibre channel.  Same bandwidth as your soon-to-be setup.
>Though the backend was different, one FasTt600 and one SataBlade, each
>with a single 2Gb FC link.

I'll bet it was a lot neater too :)

>> The only question left is what will happen when there is only one xen
>> box asking to read data from the SAN? Will the SAN attempt to send
>the
>> data at 8Gbps, flooding the 2Gbps that the client can handle, and
>
>You're not using balance-rr inappropriately here.  So, no, this isn't
>an
>issue.  In a request/reply chain, responses will only go out at the
>same
>rate requests come in.  With two ports making requests to 8 ports, each
>of the 8 will receive and reply to 1/4th of the total requests,
>generating ~1/4th of the total bandwidth.  200/8=25, so each of the 8
>ports will transmit ~25MB/s in replies.  Reply packets will be larger
>due to the data payload, but the packet queuing, window scaling, etc in
>the receiver TCP stack FOR EACH PORT will slow down the sender when
>necessary.
>
>And you're now wondering why TCP packet queuing didn't kick in with
>balance-rr, causing all of those ethernet pause frames and other
>issues.
> Answer:  I think the problem was that the TCP back off features were
>short circuited.  When you were using balance-rr, packets were likely
>arriving wildly out of sequence, from the same session, but from
>different MAC addresses.  You were sending from one IP stack out 4 MACs
>to one 1 IP stack on one MAC.
>
>With multipath iSCSI, each MAC has its own IP address and own TCP
>stack,
>so all packets always arrive in order or are easily reordered, allowing
>packet queuing, window scaling, etc, to work properly.  Balance-rr
>works
>in the cluster scenario up top because the packets still arrive in
>sequence, even though on different ports from different MACs.  Say
>1/2/3/4, 5/6/7/8, etc.  Previously you probably had packet ordering
>something like 4/1/3/2/7/8/6/5 on occasion.  This short circuited the
>receiving TCP stack preventing it from sending back offs.  The TCP
>stack
>on the server thought all was fine and kept slinging packets until the
>switch started sending back ethernet pause frames.  I'm not enough of
>an
>ethernet or TCP expert to explain what happens next, but I'd bet those
>Windows write errors are related to this.
>
>Again, ya don't have to worry about any of this mess using multipath.
>And, you get full port balanced bandwidth on all Xen hosts, and all
>server ports, all the time.  Pretty slick.
>
>> generate all the pause messages, or is this not relevant and it will
>> "just work". Actually, I think from reading the docs, it will only
>use
>> one link out of each group of 4 to send the data, hence it won't
>attempt
>> to send at more than 2Gbps to each client....
>
>See above.

I'm not confident, but will give it a go and see.... since we will send one small request to each of the 8 san1 IP's, and each of those can reply at 1Gbps, and the reply will be bigger than the request (reads). Though I suppose we won't submit the next read request until after we get the first one, so perhaps this will keep things under control .... I'll let you know how it goes....

>> I don't think this system will scale any further than this, I can
>only
>> add additional single Gbps ports to the xen hosts, and I can only add
>> one extra 4 x 1Gbps ports to each SAN server.... Best case is add 4 x
>> 10Gbps to the SAN, 2 single 1Gbps ports to each xen, providing a full
>> 32Gbps to the clients, each client gets max 4Gbps. In any case, I
>think
>> that would be one kick-ass network, besides being a pain to try and
>> debug, keep cabling neat and tidy, etc... Oh, and the current SSD's
>> wouldn't be that fast... At 400MB/s read, times 7 data disks is
>> 2800GB/s, actually, damn, that's fast.
>
>You could easily get by with a single quad port NIC on iSCSI target
>duty
>in each server.  That's 800MB/s duplex, same as 4Gb fibre channel.
>That's more than sufficient for your 8 Xen nodes, especially given the
>bulk of your traffic is SMB, which is limited to ~100MB/s, which means
>~100MB/s on the SAN, with ~300MB/s breathing room.  Use the other quad
>port NIC direct connected with x-over cables to the other server for
>DRBD using balance-rr.

Yes, at some point I'm going to need to increase the connection for DRBD from the current 1Gbps, but one thing at a time :)

>> The only additional future upgrade I would plan is to upgrade the
>> secondary san to use SSD's matching the primary. Or add additional
>SSD's
>> to expand storage capacity and I guess speed. I may also need to add
>> additional ethernet ports to both SAN1 and SAN2 to increase the DRBD
>> cross connects, but these would I assume be configured using linux
>> bonding in balance-rr since there is no switch in between.
>
>See above.
>And more than sufficient hardware.  I was under the impression that
>this much capital was not available, or I'd had different
> recommendations.
>One being very similar to what you came up with here.

So did I, until they said "Just fix it, whatever you need...." so that's when I had to make sure to purchase everything I might need in one go, and make sure it would work the first time .....

>Two words:  Murphy's law

Thanks, I thought I was in trouble after installing the equipment into san1, there was no keyboard, eventually I pulled both quad port ethernets, still keyboard was really unreliable, pulled the new sata controller, same thing.... Eventually tried a different keyboard, and it was perfect.... I can't believe a USB keyboard would fail right in the middle of a major upgrade. Thankfully there was a spare USB keyboard in it's box available or I might have taken me hours longer to sort it out!

>> No problem, this has been a definite learning experience for me and I
>> appreciate all the time and effort you've put into assisting.
>
>There are millions of folks spewing vitriol at one another at any
>moment
>on the net.  I prefer to be constructive, help people out when I can,
>pass a little knowledge and creativity when possible, learn things
>myself.  That's not to say I don't pop at folks now and then when
>frustration boils. ;)  I'm human too.

Absolutely agree with all that :)

>> BTW, I went last night (monday night) and removed one dual port card
>> from the san2, installed into the xen host running the DC VM.
>Configured
>> the two new ports on the xen box as active-backup (couldn't get LACP
>to
>> work since the switch only supports max of 4 LAG's anyway). Removed
>one
>> port from the LAG on san1, and setup the three ports (1 x san + 2 x
>> xen1) as a VLAN with private IP address on a new subnet.  Today,
>> complaints have been non-existant, mostly relating to issues they had
>> yesterday but didn't bother to call until today. It's now 4:30pm, so
>I'm
>> thinking that the problem is solved just with that done.
>
>So the biggest part of the problem was simply SMB and iSCSI on the same
>link on the DC.  Let's see how the new system does with that user in
>need of a clue stick, doing his 50GB SMB xfer when all users are
>humming away.

Well, I'm onsite now, in progress, got 3 hours to finish (by 7am) so better go and get it sorted!

>I appreciate the fact you posted the topic.  I had to go (re)learn a
>little bit myself.  I just hope read you this email before trying to
>stick MPIO on top of a channel bond. ;)

Only just in time, I had already installed some of the equipment when I got this.... 

>Send me a picture of the racked gear when it's all done, front, and
>back, so I can see how ugly that Medusa is, and remind myself of one of
>many reasons I prefer fibre channel. ;)

It is a real mess, I forgot to order extra cables, so pulled random lengths of second hand cables from the spares cupboard here... there are no more spares now, but I think they are all working.... Will send some pics, but I will need to come back another time with some new cables, cable ties, and some sort of cable labelling equipment to fix this up!

Thanks again, off to finish implementing now


Regards,
Adam

--
Adam Goryachev
Website Managers
www.websitemanagers.com.au

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance
  2013-02-13 16:17                             ` Adam Goryachev
@ 2013-02-13 20:20                               ` Adam Goryachev
  2013-02-14 12:22                                 ` Stan Hoeppner
  0 siblings, 1 reply; 131+ messages in thread
From: Adam Goryachev @ 2013-02-13 20:20 UTC (permalink / raw)
  To: stan; +Cc: Dave Cundiff, linux-raid

Well, it's 7am, and I'm still here.... It all didn't go as well as I had planned....

I initially could ping perfectly from either of the two IP's on the xen box to any of the 8 IP's on the san1, even ping -f worked perfectly. Whatever I did, I couldn't get a iscsiadm .... discover to work... I could see the packets being sent from the san box (tcpdump) but never received by the xen box.

Eventually I pulled the disabled all except one ethernet device on both machines, still no luck. Finally, out of desperation I pulled the cables from both machines, dropped in a direct cable (ie, bypass the nice shiny new switch), and discover worked immediately. So I tried with the old switch, but same problem, so I've now connected each xen box direct to san1 ethernet port, so they now all get a dedicated 1 Gbps port each.

I think the problem with the switch is that I didn't configure it properly to support the 9000 MTU, or something like that, which now makes more sense that lots of small packets are fine (not faulty cables, network cards, switches, etc) but big packets fail (like the response to a DiscoveryAll packet).

Anyway, all systems are online, and I think I will leave things as is for now.

What I have accomplished:
1) All systems should be using dedicated 1Gbps for iSCSI and 1Gbps for everything else
2) All hardware is physically installed

What I think I need next time
1) 10 x colour coded 2m cables (management/user LAN ports), probably blue to match all the rest of the user cabling
2) 8 cables in green (port 1 xen)
3) 8 cables in yellow (port 2 xen)
4) 8 cables in white (4 each for san1/san2 on 1st card)
5) 8 cables in grey (4 each for san1/san2 on 2nd card)
6) Lotsa cable ties to keep each bundle together

Don't really know what colour cables are available, or even sure if it is such a good idea to use so many different colours.... Another option would be to stick with two colours, one for the iSCSI SAN network, and the second colour for the user LAN. Just makes it hard trying to work out which port/machine the other end of this random cable is connected to.....

Anyway, monitoring systems say everything is ok, testing says it's working, so I'm off home. No pictures yet, so messy it's embarrassing, and it isn't even working properly. Hopefully when I'm finished it will be worth a picture or two :)

Regards,
Adam

Regards,
Adam

--
Adam Goryachev
Website Managers
www.websitemanagers.com.au

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance
  2013-02-13 20:20                               ` Adam Goryachev
@ 2013-02-14 12:22                                 ` Stan Hoeppner
  2013-02-15 13:31                                   ` Stan Hoeppner
  0 siblings, 1 reply; 131+ messages in thread
From: Stan Hoeppner @ 2013-02-14 12:22 UTC (permalink / raw)
  To: Adam Goryachev; +Cc: Dave Cundiff, linux-raid

On 2/13/2013 2:20 PM, Adam Goryachev wrote:
> Well, it's 7am, and I'm still here.... It all didn't go as well as I had
> planned....

Never does...

> I initially could ping perfectly from either of the two IP's on the xen
> box to any of the 8 IP's on the san1, even ping -f worked perfectly.
> Whatever I did, I couldn't get a iscsiadm .... discover to work... I
> could see the packets being sent from the san box (tcpdump) but never
> received by the xen box.

I'm pretty sure I know what most, if not all, of the problem is here.
For this iSCSI/multipath setup to work with all the ethernet ports
(clients and server) on a single subnet, you have to configure source
routing.  Otherwise the Linux kernel is going to use a single interface
for all outbound IP packets destined for the subnet.  So, you have two
options:

1.  Keep a single subnet and configure source routing
2.  Switch to using 8 unique subnets, one per server port

With more than two iSCSI target IPs/ports on the server, using unique
subnets on each port will be a PITA to configure on the Xen client
machines, as you'll have to bind 8 different addresses to each ethernet
port.  And keeping track of how you've setup 8 different subnets will be
a PITA.  So assuming you already have all the interfaces on a single
subnet, source routing is probably much easier.  I believe this how we
do it.

I don't know your port or IP info so I'm using fictitious values in this
example how-to, and subnet 192.168.101.0/24.

Let's start with the iSCSI target server, san1.  First, you probably
need to revert the arp changes you made back to their original values.
The changes you made earlier, according to your email, were:

echo 1 > /proc/sys/net/ipv4/conf/all/arp_ignore
echo 2 > /proc/sys/net/ipv4/conf/all/arp_announce


Next enable arp_filter on all 8 SAN ports:

~$ echo 1 > /proc/sys/net/ipv4/conf/eth0/arp_filter
......
~$ echo 1 > /proc/sys/net/ipv4/conf/eth7/arp_filter


Then create 8 table entries with names, such as port_0 thru port_7:

~$ echo 100 port_0 >> /etc/iproute2/rt_tables
......
~$ echo 101 port_7 >> /etc/iproute2/rt_tables


Next add the route table for your 8 interfaces.

~$ ip route add 192.168.101.0/24 dev eth0 src 192.168.101.0 table port_0
......
~$ ip route add 192.168.101.0/24 dev eth7 src 192.168.101.7 table port_7


Now create the source policy rules:

~$ ip rule add from 192.168.101.0 table port_0
......
~$ ip rule add from 192.168.101.7 table port_7


Now we flush the routing table cache to make the new policy active:

~$ ip route flush cache

If I have this right, now all packets from a given IP address will be
sent out the interface to which the IP is bound.

Now you need to make these same changes for the two SAN ports on each
Xen box.  Obviously start with one box and then test it before doing the
others.

This should get iscsiadm working and seeing all of the LUNs on all 8
ports on san1, and dm-multipath should work.  If it turns out that
dm-multipath doesn't fan across all 8 remote interfaces, you'll need to
manually set each Xen box to hit a specific pair of ports on san1, two
Xen boxen per pair of san1 ports.  Set it up so the Xen pairs have one
port on each quad port NIC, for redundancy.  It doesn't really make a
different whether dm-multipath fans over all 8 LUNs because you have
only 200MB/s per Xen client anyway.  That's 1.6GB/s client bandwidth and
800MB/s server.  So as long as you have port and path redundancy, two
LUN connections per client is as good as 8.  I've actually never seen a
SAN setup with clients logging into more than two head ports.

Most configurations such as this use multiple switches.  So the switch
may still give us problems.  If so we'll have to figure out an
appropriate multiple VLAN setup.  And do all of the above with standard
frame size.  If/when it's working try larger MTU.

> Eventually I pulled the disabled all except one ethernet device on both
> machines, still no luck. 

After so much reconfiguration it's hard to tell what all was going wrong
at this point.

> Finally, out of desperation I pulled the cables
> from both machines, dropped in a direct cable (ie, bypass the nice shiny
> new switch), and discover worked immediately. So I tried with the old
> switch, but same problem, so I've now connected each xen box direct to
> san1 ethernet port, so they now all get a dedicated 1 Gbps port each.

If the source routing config above doesn't immediately work, or if you
get full bandwidth out to the Xen hosts, but only half into to san1, you
may need to create 2 isolated VLANs, put two ports of each quad NIC in
each, and one port of each Xen box in each VLAN.

> I think the problem with the switch is that I didn't configure it
> properly to support the 9000 MTU, or something like that, which now
> makes more sense that lots of small packets are fine (not faulty cables,
> network cards, switches, etc) but big packets fail (like the response to
> a DiscoveryAll packet).

You may have simply confused it with all the link plugging and chugging.
 In the past I've seen odd things like switches holding onto a MAC on
port1 ten minutes after I pulled the server from port1 and plugged it
into port10, forcing me to reboot or power cycle the switch to clear the
MAC table.  Other switches handle this with aplomb.  It's been many
years since I've seen that though, and it was a low end model.

> Anyway, all systems are online, and I think I will leave things as is
> for now.

The fact that it's working well enough (and far better than previously),
even if not yet perfected, is the most important part. :)  The client
isn't screaming anymore.

Worth noting is with direct connection you eliminate the switch latency,
increasing throughput.  Though you need to get this all working through
a switch, with both links for redundancy, and so you can expand with
more Xen hosts if needed.  Right now you're out of server ports.  And
you're probably close to exhausting the PCIe slots in san1.

> What I have accomplished:
> 1) All systems should be using dedicated 1Gbps for iSCSI and 1Gbps for
> everything else
> 2) All hardware is physically installed

> What I think I need next time
> 1) 10 x colour coded 2m cables (management/user LAN ports), probably
> blue to match all the rest of the user cabling
> 2) 8 cables in green (port 1 xen)
> 3) 8 cables in yellow (port 2 xen)
> 4) 8 cables in white (4 each for san1/san2 on 1st card)
> 5) 8 cables in grey (4 each for san1/san2 on 2nd card)
> 6) Lotsa cable ties to keep each bundle together

There's your problem.  No orange. ;)

(most LC multimode fiber SAN cables are orange)

> Don't really know what colour cables are available, or even sure if it
> is such a good idea to use so many different colours.... Another option
> would be to stick with two colours, one for the iSCSI SAN network, and
> the second colour for the user LAN. Just makes it hard trying to work
> out which port/machine the other end of this random cable is connected
> to.....

Two colors for SAN: one for Xen boxen, one for servers.  Label each
cable end with it's respective switch or host port assignment.  One inch
printer labels work well as they stick to the cable and themselves, so
well you have to cut them off.  I think somebody sells something fancier
but why bother, as long as you can read your own handwriting.  Label the
Intel NIC ports if they aren't numbered.  That's how I normally do it.

> Anyway, monitoring systems say everything is ok, testing says it's
> working, so I'm off home. No pictures yet, so messy it's embarrassing,

*ALL* racks/closets are messy.  It's only environments where folks are
under worked and overpaid that everything is tidy: govt, uni, big corp.
 Nobody else has time.  And if you're a VAR/consultant paid by the hour,
clients don't give a crap about looks, as long as it works.  They don't,
or rarely, go into the server room, closet, etc anyway.

> and it isn't even working properly. Hopefully when I'm finished it will
> be worth a picture or two :)

You'll get there before long.  The final configuration may not be
exactly what you envisioned, but I guarantee your overall goals will be
met soon.  You've doubled your bandwidth by isolating user/san traffic
so you're half way there already.

-- 
Stan


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance
  2013-02-08 21:42             ` Stan Hoeppner
@ 2013-02-14 22:42               ` Chris Murphy
  2013-02-15  1:10                 ` Adam Goryachev
  0 siblings, 1 reply; 131+ messages in thread
From: Chris Murphy @ 2013-02-14 22:42 UTC (permalink / raw)
  To: linux-raid@vger.kernel.org list
  Cc: Adam Goryachev, stan@hardwarefreak.com Hoeppner

(digging back through some things now that the higher priority tasks appear covered)


On Feb 8, 2013, at 2:42 PM, Stan Hoeppner <stan@hardwarefreak.com> wrote:

> These tests use 4KB *aligned* IOs.

It seems SSD's commonly now are 8KB paged. [1] [2] At least on OS X with a Samsung 830 SSD, I'm finding a meaningful difference between alignment on 4K vs a 1M alignment. [3] 

Sequential write and rewrite aren't affected. Sequential Input is affected, 5.6% improvement by 8K aligning. Random Seeks see an 87% improvement with 1M alignment. I haven't retested to see if an 8K alignment produces as good a result as a 1M alignment. I haven't tested the full effect of Bonnie++ chunk size which is 8KB by default; but in all tests so far there's no meaningful difference between a chunk size of 4KB and 8KB.

It's kindof annoying that SSD manufacturers aren't reporting a "physical sector" mapped to the SSD page size; similar to how 512e AF HDDs report 512 byte logical, 4096 byte physical sectors. The implication of an SSD reporting a 512 byte physical sector is that alignment doesn't matter. I think it might matter.


> If you've partitioned the SSDs, and your partition boundaries fall in
> the middle of erase blocks instead of perfectly between them, then your
> IOs will be unaligned, and performance will suffer.

There doesn't appear to be a way to know what LBA's mark such a boundary. [3] There also is inconsistent understanding of the erase block size, I see 128KB to 2MB erase block sizes published in media, but nothing from manufacturers. So I don't know how we'd know this.

Also, everything I've read indicates that the LBA's in an erase block are not sequential. Only pages have sequential LBA's. LBA 0-7 could be a page on dye 4, while LBA 8-15 could be on dye 2. The firmware manages the relationship between LBAs and physical pages; similar to how HDD firmware will remap an LBA to a different physical sector in case of bad sectors; except in this case it's SOP to manage wear leveling and the fact a static mapping can likely mean mapping to physical pages that aren't yet erased (garbage collected). Writes would be significantly negatively impacted by this.

Anyway, I'm skeptical we have sufficient knowledge beyond 4K or 8K alignment. At least on linux, fortunately, the now common default of LBA 2048 is of course aligned 0.5K through 1M. But it seems the LBA 40 set by Apple might not be a good idea.



[1]  http://arstechnica.com/information-technology/2012/06/inside-the-ssd-revolution-how-solid-state-disks-really-work/3/
[2]  http://www.anandtech.com/show/4244/intel-ssd-320-review/2

[3]  By that I mean a partition that starts on LBA 40 vs an LBA of 2048. OS X's Disk Utility defaults to using GPT partition scheme with partition 1 starting at LBA 40. Results are repeatable by setting to an LBA that is divisible by 8 sectors, but not divisible by 16; compared to an LBA that's divisible by 2048 sectors. (Sectors defined as a 512 byte sector.)

[4]  "Over time SSDs can get into a fairly fragmented state, with pages distributed randomly all over the LBA range."
http://www.anandtech.com/show/6328/samsung-ssd-840-pro-256gb-review/6



Chris Murphy


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance
  2013-02-14 22:42               ` Chris Murphy
@ 2013-02-15  1:10                 ` Adam Goryachev
  2013-02-15  1:40                   ` Chris Murphy
  0 siblings, 1 reply; 131+ messages in thread
From: Adam Goryachev @ 2013-02-15  1:10 UTC (permalink / raw)
  To: Chris Murphy
  Cc: linux-raid@vger.kernel.org list, stan@hardwarefreak.com Hoeppner

On 15/02/13 09:42, Chris Murphy wrote:
> (digging back through some things now that the higher priority tasks
> appear covered)
> 
> 
> On Feb 8, 2013, at 2:42 PM, Stan Hoeppner <stan@hardwarefreak.com>
> wrote:
> 
>> These tests use 4KB *aligned* IOs.
> 
> It seems SSD's commonly now are 8KB paged. [1] [2] At least on OS X
> with a Samsung 830 SSD, I'm finding a meaningful difference between
> alignment on 4K vs a 1M alignment. [3]
> 
> Sequential write and rewrite aren't affected. Sequential Input is
> affected, 5.6% improvement by 8K aligning. Random Seeks see an 87%
> improvement with 1M alignment. I haven't retested to see if an 8K
> alignment produces as good a result as a 1M alignment. I haven't
> tested the full effect of Bonnie++ chunk size which is 8KB by
> default; but in all tests so far there's no meaningful difference
> between a chunk size of 4KB and 8KB.
> 
> It's kind of annoying that SSD manufacturers aren't reporting a
> "physical sector" mapped to the SSD page size; similar to how 512e AF
> HDDs report 512 byte logical, 4096 byte physical sectors. The
> implication of an SSD reporting a 512 byte physical sector is that
> alignment doesn't matter. I think it might matter.

So assuming I don't mind wasting a few MB per disk (I was leaving empty
space at the end of the partition anyway), what would I need to instruct
fdisk and/or md to do to get the alignment right?

Current partition/disk is as follows:
fdisk -l /dev/sdc
Disk /dev/sdc: 480 GB, 480101368320 bytes
255 heads, 63 sectors/track, 58369 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

   Device Boot      Start         End      Blocks   Id  System
/dev/sdc1               1       58000   465884968   fd  Lnx RAID auto

fdisk -ul /dev/sdc:
Disk /dev/sdc: 480 GB, 480101368320 bytes
255 heads, 63 sectors/track, 58369 cylinders, total 937697985 sectors
Units = sectors of 1 * 512 = 512 bytes

   Device Boot      Start         End      Blocks   Id  System
/dev/sdc1              63   931769999   465884968   fd  Lnx RAID auto

Thanks,
Adam

-- 
Adam Goryachev
Website Managers
www.websitemanagers.com.au

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance
  2013-02-15  1:10                 ` Adam Goryachev
@ 2013-02-15  1:40                   ` Chris Murphy
  2013-02-15  4:01                     ` Adam Goryachev
  0 siblings, 1 reply; 131+ messages in thread
From: Chris Murphy @ 2013-02-15  1:40 UTC (permalink / raw)
  To: Adam Goryachev
  Cc: linux-raid@vger.kernel.org list, stan@hardwarefreak.com Hoeppner


On Feb 14, 2013, at 6:10 PM, Adam Goryachev <mailinglists@websitemanagers.com.au> wrote:
> 
> So assuming I don't mind wasting a few MB per disk (I was leaving empty
> space at the end of the partition anyway), what would I need to instruct
> fdisk and/or md to do to get the alignment right?
> 
> Current partition/disk is as follows:
> /dev/sdc1              63   931769999   465884968   fd  Lnx RAID auto

Yeah the old fdisk starts at LBA 63 which is not good because it's neither 4K nor 8K aligned. For a long time now fdisk starts new ones at 2048 which is 1MB aligned (and is of course also 8KB and 4KB aligned, among others so it's just easier to do that.) But for 4K and 8K alignment, start LBA 48, 64, 80, etc are also valid.

Chris Murphy

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance
  2013-02-15  1:40                   ` Chris Murphy
@ 2013-02-15  4:01                     ` Adam Goryachev
  2013-02-15  5:14                       ` Chris Murphy
  0 siblings, 1 reply; 131+ messages in thread
From: Adam Goryachev @ 2013-02-15  4:01 UTC (permalink / raw)
  To: Chris Murphy
  Cc: linux-raid@vger.kernel.org list, stan@hardwarefreak.com Hoeppner

On 15/02/13 12:40, Chris Murphy wrote:
> 
> On Feb 14, 2013, at 6:10 PM, Adam Goryachev
> <mailinglists@websitemanagers.com.au> wrote:
>> 
>> So assuming I don't mind wasting a few MB per disk (I was leaving
>> empty space at the end of the partition anyway), what would I need
>> to instruct fdisk and/or md to do to get the alignment right?
>> 
>> Current partition/disk is as follows: /dev/sdc1              63
>> 931769999   465884968   fd  Lnx RAID auto
> 
> Yeah the old fdisk starts at LBA 63 which is not good because it's
> neither 4K nor 8K aligned. For a long time now fdisk starts new ones
> at 2048 which is 1MB aligned (and is of course also 8KB and 4KB
> aligned, among others so it's just easier to do that.) But for 4K and
> 8K alignment, start LBA 48, 64, 80, etc are also valid.

Probably a stupid question, but how do I force fdisk to start at 64 ? or
even 2048?

Would it be a sequence like this:
fdisk /dev/sdb
d <- delete the existing partition
u <- change units
n <- new partition
p <- primary
1 <- partition 1
64 <- start sector 64
xxx <- end size of partition

Will that make it right?

I'm planning on doing:
mdadm --manage /dev/md0 --fail /dev/sdb1
mdadm --manage /dev/md0 --remove /dev/sdb1
fdisk (delete and re-create partition, same size, different start/end)
mdadm --manage /dev/md0 --add /dev/sdb1
Wait for resync to complete, and repeat with the next disk.

I was thinking to also use smartctl to instruct the disk to do a format
before creating the new partition. Would that help to "trim" the disk,
or not make any difference at all (since md will re-write the whole disk
in the resync phase anyway).

Thanks,
Adam

-- 
Adam Goryachev
Website Managers
www.websitemanagers.com.au

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance
  2013-02-15  4:01                     ` Adam Goryachev
@ 2013-02-15  5:14                       ` Chris Murphy
  2013-02-15 11:10                         ` Adam Goryachev
  0 siblings, 1 reply; 131+ messages in thread
From: Chris Murphy @ 2013-02-15  5:14 UTC (permalink / raw)
  To: Adam Goryachev
  Cc: linux-raid@vger.kernel.org list, stan@hardwarefreak.com Hoeppner


On Feb 14, 2013, at 9:01 PM, Adam Goryachev <mailinglists@websitemanagers.com.au> wrote:

> Would it be a sequence like this:
> fdisk /dev/sdb
> d <- delete the existing partition
> u <- change units
> n <- new partition
> p <- primary
> 1 <- partition 1
> 64 <- start sector 64
> xxx <- end size of partition
> 
> Will that make it right?

Yes.

Chris Murphy

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance
  2013-02-15  5:14                       ` Chris Murphy
@ 2013-02-15 11:10                         ` Adam Goryachev
  2013-02-15 23:01                           ` Chris Murphy
  0 siblings, 1 reply; 131+ messages in thread
From: Adam Goryachev @ 2013-02-15 11:10 UTC (permalink / raw)
  To: Chris Murphy
  Cc: linux-raid@vger.kernel.org list, stan@hardwarefreak.com Hoeppner

On 15/02/13 16:14, Chris Murphy wrote:
> 
> On Feb 14, 2013, at 9:01 PM, Adam Goryachev <mailinglists@websitemanagers.com.au> wrote:
> 
>> Would it be a sequence like this:
>> fdisk /dev/sdb
>> d <- delete the existing partition
>> u <- change units
>> n <- new partition
>> p <- primary
>> 1 <- partition 1
>> 64 <- start sector 64
>> xxx <- end size of partition
>>
>> Will that make it right?
> 
> Yes.

OK, so I've started this process, with some unexpected results...
First, this is how the partition looks now:
Disk /dev/sdb: 480 GB, 480101368320 bytes
255 heads, 63 sectors/track, 58369 cylinders, total 937697985 sectors
Units = sectors of 1 * 512 = 512 bytes

   Device Boot      Start         End      Blocks   Id  System
/dev/sdb1              64   931770000   465893001   fd  Lnx RAID auto
Warning: Partition 1 does not end on cylinder boundary.

I'm not sure why I get that warning, or if it should worry me... I
suppose I can always extend it a bit bigger if there is any problem with
this?

Initially, I made sure the secondary san was in sync with DRBD, and all
users were logged off the system. I was getting a max of around 50MB/sec
from the RAID resync.

So I shutdown all the windows machines, and this went up to a max of
150MB/sec.

Finally, I stopped DRBD on both the secondary and the primary, so now
the RAID device is completely unused, and it is topping out at 213M/sec...

Personalities : [raid6] [raid5] [raid4]
md1 : active raid5 sdb1[6] sdc1[0] sde1[4] sdf1[5] sdd1[3]
      1863535104 blocks super 1.2 level 5, 64k chunk, algorithm 2 [5/4]
[U_UUU]
      [=============>.......]  recovery = 68.4% (318672880/465883776)
finish=12.3min speed=198212K/sec
      bitmap: 3/4 pages [12KB], 65536KB chunk


It was topping out at 200, but I adjusted
/proc/sys/dev/raid/speed_limit_max to 400000

top shows this:
top - 22:06:41 up 1 day, 17:22,  3 users,  load average: 1.08, 1.07, 1.06
Tasks: 177 total,   2 running, 175 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.1%us,  0.7%sy,  0.0%ni, 99.1%id,  0.0%wa,  0.0%hi,  0.1%si,
0.0%st
Mem:   7903292k total,  1370132k used,  6533160k free,   131796k buffers
Swap:  3939320k total,        0k used,  3939320k free,   939728k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND


  425 root      20   0     0    0    0 S   26  0.0  20:27.27 md1_raid5


26236 root      20   0     0    0    0 R   17  0.0   4:22.30 md1_resync


   27 root      20   0     0    0    0 S    0  0.0   7:17.68 events/0



also vmstat 5 shows this...
procs -----------memory---------- ---swap-- -----io---- -system--
----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy
id wa
 2  0      0 6532916 131820 939744    0    0   410   121   32    6  0  0
99  0
 0  0      0 6533512 131824 939744    0    0     0    13 28280 28796  0
 1 99  0
 1  0      0 6533300 131832 939744    0    0     0    13 25842 26591  0
 1 99  0
 1  0      0 6533864 131836 939748    0    0     0     8 30910 31189  0
 1 99  0

So it seems CPU is idle, but I'm curious why I don't see somewhat higher
write speeds... I thought I should see something close to 300 or
400MB/sec, or was I just plain wrong?

Just a reminder, these are the Intel 320 series 480G SSD's.

-- 
Adam Goryachev
Website Managers
www.websitemanagers.com.au

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance
  2013-02-14 12:22                                 ` Stan Hoeppner
@ 2013-02-15 13:31                                   ` Stan Hoeppner
  2013-02-15 14:32                                     ` Adam Goryachev
  0 siblings, 1 reply; 131+ messages in thread
From: Stan Hoeppner @ 2013-02-15 13:31 UTC (permalink / raw)
  To: stan; +Cc: Adam Goryachev, Dave Cundiff, linux-raid

On 2/14/2013 6:22 AM, Stan Hoeppner wrote:

> Then create 8 table entries with names, such as port_0 thru port_7:
> 
> ~$ echo 100 port_0 >> /etc/iproute2/rt_tables
> ......
> ~$ echo 101 port_7 >> /etc/iproute2/rt_tables

Correcting a typo here, this 2nd line above should read:

~$ echo 107 port_7 >> /etc/iproute2/rt_tables


These 7 commands result in a routing table like this:

100	port_0
101	port_1
102	port_2
103	port_3
104	port_4
105	port_5
106	port_6
107	port_7

The commands below this in the previous email populate the table with
the source routing rules.  With arp_filter enabled, what all of this
does is allow each of the 8 interfaces to behave just like 8 individual
hosts on the same subnet would.  And thinking about this for a brief
moment, you realize this should work just fine on a single switch
without any special switch configuration.  arp_filter docs tell us:

arp_filter - BOOLEAN
	1 - Allows you to have multiple network interfaces on the same
	subnet, and have the ARPs for each interface be answered
	based on whether or not the kernel would route a packet from
	the ARP'd IP out that interface (therefore you must use source
	based routing for this to work). In other words it allows
	control of which cards (usually 1) will respond to an arp
	request.


	0 - (default) The kernel can respond to arp requests with
	addresses from other interfaces. This may seem wrong but it
	usually makes sense, because it increases the chance of
	successful communication.  IP addresses are owned by the
	complete host on Linux, not by particular interfaces. Only for
	more complex setups like load-balancing, does this behaviour
	cause problems.

	arp_filter for the interface will be enabled if at least one of
	conf/{all,interface}/arp_filter is set to TRUE,
	it will be disabled otherwise

As you have other interfaces on the user subnet, we're enabling this
only for the SAN subnet, on a per interface basis, otherwise it would
cause problems with the user subnet interfaces.

So now all SAN subnet traffic from a given interface is properly sent
from that interface.  With your previous arp tweaks it seems each
interface was responding to arps, but TCP packets were still all going
out a single interface.  This configuration fixes that.

** IMPORTANT **
All of the work you've done with iscsiadm to this point has been with
clients having a single iSCSI ethernet port and single server target
port, and everything "just worked" without specifying local and target
addresses (BTW, don't use the server hostname for any of these
operations, obviously, only the IP addresses as they won't map).  Since
you will now have two local iSCSI addresses and potentially 8 target
addresses, discovery and possibly operations should probably be done on
a 1:1 port basis to make sure both client ports are working and both are
logging into the correct remote ports and mapping the correct LUNs.
Executing the same shell command 128 times across 8 hosts, changing
source and port IP addresses each time, seems susceptible to input
errors.  Two per host less so.

On paper, if multipath will fan all 8 remote ports from each client
port, theoretically you could getter better utilization in some client
access pattern scenarios.  But in real world use, you won't see a
difference.  Given the complexity of trying to use all 8 server ports
per client port, if this was my network, I'd do it like this,
conceptually:  http://www.hardwarefreak.com/lun-mapping.png
Going the "all 8" route you'd add another 112 lines to that diagram atop
the current 16.  That seems a little "busy" and unnecessary, more
difficult to troubleshoot.

Yes, I originally suggested fanning across all 8 ports, but after
weighing the marginal potential benefit against the many negatives, it's
clear to me that it's not the way to go.

So during your next trip to the client, once you have all of your new
cables and ties, it should be relatively quick to set this up.  Going
the "all 8" route maybe not so quick.

-- 
Stan


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance
  2013-02-15 13:31                                   ` Stan Hoeppner
@ 2013-02-15 14:32                                     ` Adam Goryachev
  2013-02-16  1:07                                       ` Stan Hoeppner
  0 siblings, 1 reply; 131+ messages in thread
From: Adam Goryachev @ 2013-02-15 14:32 UTC (permalink / raw)
  To: stan; +Cc: Dave Cundiff, linux-raid

Stan Hoeppner <stan@hardwarefreak.com> wrote:

>On 2/14/2013 6:22 AM, Stan Hoeppner wrote:
>
>> Then create 8 table entries with names, such as port_0 thru port_7:
>> 
>> ~$ echo 100 port_0 >> /etc/iproute2/rt_tables
>> ......
>> ~$ echo 101 port_7 >> /etc/iproute2/rt_tables
>
>Correcting a typo here, this 2nd line above should read:
>
>~$ echo 107 port_7 >> /etc/iproute2/rt_tables

Thank you for that, I wasn't too sure I trusted your suggestion so I did go and do some research including reading the sysctl information you pasted, and it sounded correct... so, should be good :)

>** IMPORTANT **
>All of the work you've done with iscsiadm to this point has been with
>clients having a single iSCSI ethernet port and single server target
>port, and everything "just worked" without specifying local and target
>addresses (BTW, don't use the server hostname for any of these
>operations, obviously, only the IP addresses as they won't map).  Since
>you will now have two local iSCSI addresses and potentially 8 target
>addresses, discovery and possibly operations should probably be done on
>a 1:1 port basis to make sure both client ports are working and both
>are
>logging into the correct remote ports and mapping the correct LUNs.
>Executing the same shell command 128 times across 8 hosts, changing
>source and port IP addresses each time, seems susceptible to input
>errors.  Two per host less so.

Hmmm, 8 SAN IP's x 2 interfaces x 8 machines is a total of 128, or only 16 times on each machine. Personally, it sounds like the perfect case of scripting :)

However, another downside is that if I add another 8 IP's on the secondary san, I have 16 SAN IP's x 2 interfaces x 8 machines, or 256 entries. However, I think linux MPIO has a max of 8 paths anyway, so I was going to have to cull this down I suspect.

>On paper, if multipath will fan all 8 remote ports from each client
>port, theoretically you could getter better utilization in some client
>access pattern scenarios.  But in real world use, you won't see a
>difference.  Given the complexity of trying to use all 8 server ports
>per client port, if this was my network, I'd do it like this,
>conceptually:  http://www.hardwarefreak.com/lun-mapping.png
>Going the "all 8" route you'd add another 112 lines to that diagram
>atop
>the current 16.  That seems a little "busy" and unnecessary, more
>difficult to troubleshoot.

The downside to your suggestion is that if machine 1 and 5 are both busy at the same time, they only get 1Gbps each. Keep the vertical paths as is, but change the second path to an offset of only 1 (or 2 or 3 would work, just not 4), then there are no pair of hosts sharing both ports, so two machines busy can still get 1.5Gbps.... 

>Yes, I originally suggested fanning across all 8 ports, but after
>weighing the marginal potential benefit against the many negatives,
>it's clear to me that it's not the way to go.
>
>So during your next trip to the client, once you have all of your new
>cables and ties, it should be relatively quick to set this up.  Going
>the "all 8" route maybe not so quick.

I'm still considering the option of configuring the SAN server with two groups of 4 ports in a balance-alb bond, then the clients only need MPIO from two ports to two SAN IP's, or 4 paths each, plus the bond will manage the traffic balancing at the SAN server side across any two ports..... I can even lose the source based routing if I use different subnets and different VLAN's, and ignore the arp issues all around. I think that and your solution above are mostly equal, but I'll try the suggestion above first, if I get stuck, this would be my fallback plan.... 

I really need to get this done by Monday after today (one terminal server was offline for 30 minutes, you would think that was the end of the world....I thought 30 minutes was pretty good, since it took 20 minutes for them to tell me........).

So, I'll let you know how it goes, and hopefully show off some flashy pictures to boot, and then next week will be the real test from the users...

PS, the once a week backup process person advised that the backup on thursday night was 10% faster than normal... So that was a promising sign.


Regards,
Adam

--
Adam Goryachev
Website Managers
www.websitemanagers.com.au

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance
  2013-02-15 11:10                         ` Adam Goryachev
@ 2013-02-15 23:01                           ` Chris Murphy
  0 siblings, 0 replies; 131+ messages in thread
From: Chris Murphy @ 2013-02-15 23:01 UTC (permalink / raw)
  To: Adam Goryachev
  Cc: linux-raid@vger.kernel.org list, stan@hardwarefreak.com Hoeppner


On Feb 15, 2013, at 4:10 AM, Adam Goryachev <mailinglists@websitemanagers.com.au> wrote:
> 
> Warning: Partition 1 does not end on cylinder boundary.
> 
> I'm not sure why I get that warning, or if it should worry me... I
> suppose I can always extend it a bit bigger if there is any problem with
> this?

Because that fdisk is old and it's still thinking in the era of cylinder head sector; but that doesn't matter anymore. Everything is access by LBA these days.


> it is topping out at 213M/sec…

For reference, I have a 2011 Macbook Pro (Core i7 2820QM) with SATA Rev 3.0, and a Samsung 830 SSD (the baby brother to the 840 Pros you have) and I consistently get 463 MB/s reads and 417 MB/s writes (with Bonnie++). That's one SSD, not RAID or anything. And it's a laptop.

> Just a reminder, these are the Intel 320 series 480G SSD's.

Hmm, I guess I'm confused.


Chris Murphy--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance
  2013-02-15 14:32                                     ` Adam Goryachev
@ 2013-02-16  1:07                                       ` Stan Hoeppner
  2013-02-16 17:19                                         ` Adam Goryachev
  0 siblings, 1 reply; 131+ messages in thread
From: Stan Hoeppner @ 2013-02-16  1:07 UTC (permalink / raw)
  To: Adam Goryachev; +Cc: Dave Cundiff, linux-raid

On 2/15/2013 8:32 AM, Adam Goryachev wrote:
> Stan Hoeppner <stan@hardwarefreak.com> wrote:
> 
>> On 2/14/2013 6:22 AM, Stan Hoeppner wrote:
>>
>>> Then create 8 table entries with names, such as port_0 thru port_7:
>>>
>>> ~$ echo 100 port_0 >> /etc/iproute2/rt_tables
>>> ......
>>> ~$ echo 101 port_7 >> /etc/iproute2/rt_tables
>>
>> Correcting a typo here, this 2nd line above should read:
>>
>> ~$ echo 107 port_7 >> /etc/iproute2/rt_tables
> 
> Thank you for that, I wasn't too sure I trusted your suggestion so I did go and do some research including reading the sysctl information you pasted, and it sounded correct... so, should be good :)

Even if you're conversing with Linus himself, it's always good to
independently verify everything coming from a list, precisely due to
things as mundane as typos, let alone things that are factually
incorrect for one reason or another.  I've been guilty of both over the
years, and I'd guess I have a lot of company. ;)  Most people don't
intend to make such mistakes, but we all do, on occasion.

>> ** IMPORTANT **
>> All of the work you've done with iscsiadm to this point has been with
>> clients having a single iSCSI ethernet port and single server target
>> port, and everything "just worked" without specifying local and target
>> addresses (BTW, don't use the server hostname for any of these
>> operations, obviously, only the IP addresses as they won't map).  Since
>> you will now have two local iSCSI addresses and potentially 8 target
>> addresses, discovery and possibly operations should probably be done on
>> a 1:1 port basis to make sure both client ports are working and both
>> are
>> logging into the correct remote ports and mapping the correct LUNs.
>> Executing the same shell command 128 times across 8 hosts, changing
>> source and port IP addresses each time, seems susceptible to input
>> errors.  Two per host less so.
> 
> Hmmm, 8 SAN IP's x 2 interfaces x 8 machines is a total of 128, or only 16 times on each machine. Personally, it sounds like the perfect case of scripting :)

Look around.  Someone may have already written one.

> However, another downside is that if I add another 8 IP's on the secondary san, I have 16 SAN IP's x 2 interfaces x 8 machines, or 256 entries. However, I think linux MPIO has a max of 8 paths anyway, so I was going to have to cull this down I suspect.
> 
>> On paper, if multipath will fan all 8 remote ports from each client
>> port, theoretically you could getter better utilization in some client
>> access pattern scenarios.  But in real world use, you won't see a
>> difference.  Given the complexity of trying to use all 8 server ports
>> per client port, if this was my network, I'd do it like this,
>> conceptually:  http://www.hardwarefreak.com/lun-mapping.png
>> Going the "all 8" route you'd add another 112 lines to that diagram
>> atop
>> the current 16.  That seems a little "busy" and unnecessary, more
>> difficult to troubleshoot.
> 
> The downside to your suggestion is that if machine 1 and 5 are both busy at the same time, they only get 1Gbps each. Keep the vertical paths as is, but change the second path to an offset of only 1 (or 2 or 3 would work, just not 4), then there are no pair of hosts sharing both ports, so two machines busy can still get 1.5Gbps.... 

Just make sure you only "flip" four on one side of the diagram, so to
speak.  Each Xen client should have a path to LUNs on each quad NIC for
redundancy in case of server NIC failure.  Also note that multipath uses
the round-robin path selector by default.  So if you do what you mention
here you'll gain little or nothing from the potential positive ~50%
bandwidth asymmetry in this 2 competing server situation.  To gain
something you'd need to use the service-time path selector.  See
multipath.conf for details.  Either way you may want to drop the value
of rr_min_io down from the common default of 1000 so path selection is
more frequent.  If you end up using the default round robin path
selector, you'll want a low value for more balanced link utilization.

>> Yes, I originally suggested fanning across all 8 ports, but after
>> weighing the marginal potential benefit against the many negatives,
>> it's clear to me that it's not the way to go.
>>
>> So during your next trip to the client, once you have all of your new
>> cables and ties, it should be relatively quick to set this up.  Going
>> the "all 8" route maybe not so quick.
> 
> I'm still considering the option of configuring the SAN server with two groups of 4 ports in a balance-alb bond, then the clients only need MPIO from two ports to two SAN IP's, or 4 paths each, plus the bond will manage the traffic balancing at the SAN server side across any two ports..... I can even lose the source based routing if I use different subnets and different VLAN's, and ignore the arp issues all around. I think that and your solution above are mostly equal, but I'll try the suggestion above first, if I get stuck, this would be my fallback plan.... 

I suggested the source based routing setup because it allows for a
single SAN subnet, which IMHO is cleaner, easier to manage,
troubleshoot, etc, than 8 different subnets, but while yielding full
link performance using multipath, same as the 8 subnets setup would.
And assuming you're already using a single subnet, it should be as easy
or easier to configure, as you simply have to create the routing table
on the server(s) and each Xen client, and enable arp_filter.

The solution you mention above seems conceptually easier and a bit
cleaner that the 8 subnet setup, but it sacrifices some performance as
balance-alb will not scale as well as multipath iSCSI using individual
interfaces.  But given your peak SAN IO load isn't much more than
100MB/s, it probably makes no difference which configuration you decide
to go with.

There is a silver lining in the single subnet model, now that I think
more about this.  It allows you to try the 8 port multipath fanning
option, and if that doesn't work, or work well enough, you can simply
fall back to the industry standard 2:2 configuration I mentioned most
recently.  Since you'll be testing on just one Xen host, simply remove
the appropriate 7 of 8 LUN logins on each of the two Xen client iSCSI
interfaces (and restart multipathd), leaving just two, one to a port on
each server NIC.  Test this configuration, which should work without
issue as most are setup this way.  Then simply enable arp_filter on the
other Xen boxen and populate their rt_tables.  The remaining steps of
iscsiadmin and multipath setup are common to any of the setups.  Using
this single subnet method requires no special switch configuration, no
bonding, no VLANs, zip.  It's pretty much just like have 24 hosts on the
same subnet, sans individual hostnames.

> I really need to get this done by Monday after today (one terminal server was offline for 30 minutes, you would think that was the end of the world....I thought 30 minutes was pretty good, since it took 20 minutes for them to tell me........).

The auto failover and load balancing of Citrix is pretty nice in such
circumstances.

> So, I'll let you know how it goes, and hopefully show off some flashy pictures to boot, and then next week will be the real test from the users...

If your supplier has some purple and/or hot pink patch cables, go for
it.  Dare to be different, and show you're cool. ;)  Truthfully, they
just stand out better with digital cams.

> PS, the once a week backup process person advised that the backup on thursday night was 10% faster than normal... So that was a promising sign.

Is this a LAN based backup through the DC file server, or SAN based
backup?  Either way, there are probably some cheap ways to double its
throughput now with all the extra SAN b/w soon to be available. :)

-- 
Stan


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance - *Slow SSDs likely solved*
  2013-02-07 10:05   ` Adam Goryachev
@ 2013-02-16  4:33     ` Stan Hoeppner
       [not found]       ` <cfefe7a6-a13f-413c-9e3d-e061c68dc01b@email.android.com>
  0 siblings, 1 reply; 131+ messages in thread
From: Stan Hoeppner @ 2013-02-16  4:33 UTC (permalink / raw)
  To: Adam Goryachev; +Cc: linux-raid

On 2/7/2013 4:05 AM, Adam Goryachev wrote:

> Storage server:
...
> 5 x Intel 520s MLC 480G SATA3
> Debian Stable .. *kernel 2.6.32-5-amd64*

There was a regression in 2.6.32 kernels that hammered SSD performance.
 Upgrade to linux-image-3.2.0-0.bpo.3-amd64 and it should fix it.  See:

http://www.archivum.info/linux-ide@vger.kernel.org/2010-02/00243/bad-performance-with-SSD-since-kernel-version-2.6.32.html

http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=642729

It was reverted in 2.6.33 upstream, and supposedly reverted in 2.6.32
upstream stable.  But apparently, according to Ben Hutchings last
comment in the Debian bug report, it hasn't been fixed yet in Debian
2.6.32, and that was over a year ago.

BTW, did you get the LSI card into the server yet, and if so have you
run any FIO tests to compare with the previous onboard SATA2 numbers you
posted?  Would be interesting to see what performance advantage, if any,
it has over the onboard SATA2, and would also allow us to see just how
much difference the new kernel makes.

Also, which LSI board did you get, the 9211 or 9307?  You simply said
you got the "recommended" one, or something like that.

-- 
Stan


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance
  2013-02-16  1:07                                       ` Stan Hoeppner
@ 2013-02-16 17:19                                         ` Adam Goryachev
  2013-02-17  1:42                                           ` Stan Hoeppner
  0 siblings, 1 reply; 131+ messages in thread
From: Adam Goryachev @ 2013-02-16 17:19 UTC (permalink / raw)
  To: stan; +Cc: Dave Cundiff, linux-raid

Stan Hoeppner <stan@hardwarefreak.com> wrote:

>On 2/15/2013 8:32 AM, Adam Goryachev wrote:
>> Stan Hoeppner <stan@hardwarefreak.com> wrote:
>> 
>>> On 2/14/2013 6:22 AM, Stan Hoeppner wrote:
>>>
>>>> Then create 8 table entries with names, such as port_0 thru port_7:
>>>>
>>>> ~$ echo 100 port_0 >> /etc/iproute2/rt_tables
>>>> ......
>>>> ~$ echo 101 port_7 >> /etc/iproute2/rt_tables
>>>
>>> Correcting a typo here, this 2nd line above should read:
>>>
>>> ~$ echo 107 port_7 >> /etc/iproute2/rt_tables

>Even if you're conversing with Linus himself, it's always good to
>independently verify everything coming from a list, precisely due to
>things as mundane as typos, let alone things that are factually
>incorrect for one reason or another.  I've been guilty of both over the
>years, and I'd guess I have a lot of company. ;)  Most people don't
>intend to make such mistakes, but we all do, on occasion.

Sure, most people intend the best, but at the end of the day, I'm the one that will get yelled at if I get it wrong :)

>>> ** IMPORTANT **
>>> All of the work you've done with iscsiadm to this point has been
>>> with clients having a single iSCSI ethernet port and single server
>>> target port, and everything "just worked" without specifying local
>>> and target addresses (BTW, don't use the server hostname for any
>>> of these operations, obviously, only the IP addresses as they won't
>>> map). Since you will now have two local iSCSI addresses and
>>> potentially 8 target addresses, discovery and possibly operations 
>>> should probably be done on a 1:1 port basis to make sure both
>>> client ports are working and both are
>>> logging into the correct remote ports and mapping the correct LUNs.
>>> Executing the same shell command 128 times across 8 hosts, changing
>>> source and port IP addresses each time, seems susceptible to input
>>> errors.  Two per host less so.
>> Hmmm, 8 SAN IP's x 2 interfaces x 8 machines is a total of 128, or
>only 16 times on each machine. Personally, it sounds like the perfect
>case of scripting :)
>Look around.  Someone may have already written one.

OK, I don't think any of this is going to work properly... I have 11 targets at the moment, so with two interfaces on the xen box, and 2 ip's on the san, it is going to have 4 paths per target. So I need 44 paths, but after 32 it times out all the rest of them when doing a session login. I don't see how to reduce this any further without going back to the old 1Gbps maximum performance level (and still use MPIO). I'd have to limit which targets can be seen so only a max of 8 can be seen by any host. This will only get worse if I manage to get it all working and then add more VM's to the system, I could easily end up with 20 targets.

>> However, another downside is that if I add another 8 IP's on the
>secondary san, I have 16 SAN IP's x 2 interfaces x 8 machines, or 256
>entries. However, I think linux MPIO has a max of 8 paths anyway, so I
>was going to have to cull this down I suspect.

Well, I didn't even get to this... actually, exposing all 8 IP's on san1 produced 16 paths per Target, and I did see problems trying to get that working, which is why I dropped down to the 4 paths above.

>>> On paper, if multipath will fan all 8 remote ports from each client
>>> port, theoretically you could get better utilization in some client
>>> access pattern scenarios.  But in real world use, you won't see a
>>> difference.  Given the complexity of trying to use all 8 server
>>> ports per client port, if this was my network, I'd do it like this,
>>> conceptually:  http://www.hardwarefreak.com/lun-mapping.png
>>> Going the "all 8" route you'd add another 112 lines to that diagram
>>> atop the current 16.  That seems a little "busy" and unnecessary, more
>>> difficult to troubleshoot.
>> 
>> The downside to your suggestion is that if machine 1 and 5 are both
>busy at the same time, they only get 1Gbps each. Keep the vertical
>paths as is, but change the second path to an offset of only 1 (or 2 or
>3 would work, just not 4), then there are no pair of hosts sharing both
>ports, so two machines busy can still get 1.5Gbps.... 
>
>Just make sure you only "flip" four on one side of the diagram, so to
>speak.  Each Xen client should have a path to LUNs on each quad NIC for
>redundancy in case of server NIC failure.  Also note that multipath
>uses
>the round-robin path selector by default.  So if you do what you
>mention
>here you'll gain little or nothing from the potential positive ~50%
>bandwidth asymmetry in this 2 competing server situation.  To gain
>something you'd need to use the service-time path selector.  See
>multipath.conf for details.  Either way you may want to drop the value
>of rr_min_io down from the common default of 1000 so path selection is
>more frequent.  If you end up using the default round robin path
>selector, you'll want a low value for more balanced link utilization.
>
>>> Yes, I originally suggested fanning across all 8 ports, but after
>>> weighing the marginal potential benefit against the many negatives,
>>> it's clear to me that it's not the way to go.
>>>
>>> So during your next trip to the client, once you have all of your
>>> new cables and ties, it should be relatively quick to set this up. 
>>> Going the "all 8" route maybe not so quick.
>> I'm still considering the option of configuring the SAN server with
>two groups of 4 ports in a balance-alb bond, then the clients only need
>MPIO from two ports to two SAN IP's, or 4 paths each, plus the bond
>will manage the traffic balancing at the SAN server side across any two
>ports..... I can even lose the source based routing if I use different
>subnets and different VLAN's, and ignore the arp issues all around. I
>think that and your solution above are mostly equal, but I'll try the
>suggestion above first, if I get stuck, this would be my fallback
>plan.... 

So it seems even this won't work, because I will still have 4 paths per target... Which brings me back, to square one....

I need both xen ports in a single bond, each group of 4 ports on san1 in a bond, this provides 2 paths per target (or san could be 8 ports in one bond, and xen could use two interfaces individually), and then I can get up to 16 targets which at least lets me get things working now, and potentially scales a little bit further.

Maybe it is unusual for people to use so many targets, or something... I can't seem to find anything on google about this limit, which seems to be pretty low :(

>I suggested the source based routing setup because it allows for a
>single SAN subnet, which IMHO is cleaner, easier to manage,
>troubleshoot, etc, than 8 different subnets, but while yielding full
>link performance using multipath, same as the 8 subnets setup would.
>And assuming you're already using a single subnet, it should be as easy
>or easier to configure, as you simply have to create the routing table
>on the server(s) and each Xen client, and enable arp_filter.
>
>The solution you mention above seems conceptually easier and a bit
>cleaner that the 8 subnet setup, but it sacrifices some performance as
>balance-alb will not scale as well as multipath iSCSI using individual
>interfaces.  But given your peak SAN IO load isn't much more than
>100MB/s, it probably makes no difference which configuration you decide
>to go with.

I don't understand this.... MPIO to all 8 ports would have scaled the best, I think, since it would balance all traffic for all xen boxes equally over all interfaces at both the san and xen side.

However, using the 4 path per target method will limit performance depending on who those paths are shared with. Using balance-alb will allow linux to automatically assign 2 different interfaces for each client, and in theory, support full 8Gbps for ANY 4 clients by dynamically allocating the ports instead of some arbitrary static configuration.

What am I missing or not seeing? I'm sure I'm blinded by having tried so many different things now...

>There is a silver lining in the single subnet model, now that I think
>more about this.  It allows you to try the 8 port multipath fanning
>option, and if that doesn't work, or work well enough, you can simply
>fall back to the industry standard 2:2 configuration I mentioned most
>recently.  Since you'll be testing on just one Xen host, simply remove
>the appropriate 7 of 8 LUN logins on each of the two Xen client iSCSI
>interfaces (and restart multipathd), leaving just two, one to a port on
>each server NIC.  Test this configuration, which should work without
>issue as most are setup this way.  Then simply enable arp_filter on the
>other Xen boxen and populate their rt_tables.  The remaining steps of
>iscsiadmin and multipath setup are common to any of the setups.  Using
>this single subnet method requires no special switch configuration, no
>bonding, no VLANs, zip.  It's pretty much just like have 24 hosts on
>the same subnet, sans individual hostnames.

I just don't see why this didn't work for me.... I didn't even find an option to adjust this maximum limit. I only assume it is a limit at this stage...

>> I really need to get this done by Monday after today (one terminal
>server was offline for 30 minutes, you would think that was the end of
>the world....I thought 30 minutes was pretty good, since it took 20
>minutes for them to tell me........).
>The auto failover and load balancing of Citrix is pretty nice in such
>circumstances.

Apparently Citrix is too expensive for them.... One day I may implement some linux load balancing frontend, but lots of other things before I get to messing with something like that.... Like making sure everything works properly regardless of which TS the user logs into....

>> So, I'll let you know how it goes, and hopefully show off some flashy
>pictures to boot, and then next week will be the real test from the
>users...
>
>If your supplier has some purple and/or hot pink patch cables, go for
>it.  Dare to be different, and show you're cool. ;)  Truthfully, they
>just stand out better with digital cams.

Nope, I just got blue and yellow.... I think they had green, red, black, white, grey, etc... but no really exciting colours.

>> PS, the once a week backup process person advised that the backup on
>>thursday night was 10% faster than normal... So that was a promising
>>sign.
>Is this a LAN based backup through the DC file server, or SAN based
>backup?  Either way, there are probably some cheap ways to double its
>throughput now with all the extra SAN b/w soon to be available. :)

Effectively a three step process:
1 Stop the database, use the DB admin tool to "copy" the database to a new database (lots of insert transactions)
2 Copy the files the new database was saved in to another location on the same disk
3 zip those files.

So all three steps are doing lots of iSCSI read and write at the same time....

Regards,
Adam


Regards,
Adam

--
Adam Goryachev
Website Managers
www.websitemanagers.com.au

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance
  2013-02-16 17:19                                         ` Adam Goryachev
@ 2013-02-17  1:42                                           ` Stan Hoeppner
  2013-02-17  5:02                                             ` Adam Goryachev
  0 siblings, 1 reply; 131+ messages in thread
From: Stan Hoeppner @ 2013-02-17  1:42 UTC (permalink / raw)
  To: Adam Goryachev; +Cc: Dave Cundiff, linux-raid

On 2/16/2013 11:19 AM, Adam Goryachev wrote:

> OK, I don't think any of this is going to work properly... I have 11 targets at the moment, so with two interfaces on the xen box, and 2 ip's on the san, it is going to have 4 paths per target. So I need 44 paths, but after 32 it times out all the rest of them when doing a session login. I don't see how to reduce this any further without going back to the old 1Gbps maximum performance level (and still use MPIO). I'd have to limit which targets can be seen so only a max of 8 can be seen by any host. This will only get worse if I manage to get it all working and then add more VM's to the system, I could easily end up with 20 targets.
...
> Well, I didn't even get to this... actually, exposing all 8 IP's on san1 produced 16 paths per Target, and I did see problems trying to get that working, which is why I dropped down to the 4 paths above.
...
> So it seems even this won't work, because I will still have 4 paths per target... Which brings me back, to square one....
> 
> I need both xen ports in a single bond, each group of 4 ports on san1 in a bond, this provides 2 paths per target (or san could be 8 ports in one bond, and xen could use two interfaces individually), and then I can get up to 16 targets which at least lets me get things working now, and potentially scales a little bit further.
> 
> Maybe it is unusual for people to use so many targets, or something... I can't seem to find anything on google about this limit, which seems to be pretty low :(

I regret suggesting this.  This can work in some scenarios, but it is
simply not needed in this one, whether you could get it to work or not.

> I don't understand this.... MPIO to all 8 ports would have scaled the best

*Theoretically* yes.  But your workload isn't theoretical.  You don't
design a network, user or SAN, based on balancing theoretical maximums
of each channel.  You design it based on actual user workload data
flows.  In reality, 2 ports of b/w in your SAN server is far more than
sufficient for your current workload, and is even sufficient for double
your current workload.  And nobody outside HPC designs for peak user
throughput, but average throughput, unless the budget is unlimited, and
most are not.

> However, using the 4 path per target method will limit performance depending on who those paths are shared with.

See above regarding actual workload data flows.  You're still in
theoretical land.

> Using balance-alb...

Forget using Linux bonding for SAN traffic.  It's a non starter.

> What am I missing or not seeing? I'm sure I'm blinded by having tried so many different things now...

You're allowing theory to blind you from reality.  You're looking for
something that's perfect instead of more than sufficient.

> I just don't see why this didn't work for me.... I didn't even find an option to adjust this maximum limit. I only assume it is a limit at this stage...

Forget it.  Configure the 2:2 and be done with it.

>>> I really need to get this done by Monday after today

One more reason to go with the standard 2:2 setup.

-- 
Stan


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance - *Slow SSDs likely solved*
       [not found]       ` <cfefe7a6-a13f-413c-9e3d-e061c68dc01b@email.android.com>
@ 2013-02-17  5:01         ` Stan Hoeppner
  0 siblings, 0 replies; 131+ messages in thread
From: Stan Hoeppner @ 2013-02-17  5:01 UTC (permalink / raw)
  To: Adam Goryachev, Linux RAID

On 2/16/2013 11:27 AM, Adam Goryachev wrote:
> Stan Hoeppner <stan@hardwarefreak.com> wrote:
> 
>> On 2/7/2013 4:05 AM, Adam Goryachev wrote:
>>
>>> Storage server:
>> ...
>>> 5 x Intel 520s MLC 480G SATA3
>>> Debian Stable .. *kernel 2.6.32-5-amd64*
>>
>> There was a regression in 2.6.32 kernels that hammered SSD performance.
>> Upgrade to linux-image-3.2.0-0.bpo.3-amd64 and it should fix it.  See:

Correction, make that:  linux-image-3.2.0-0.bpo.4-amd64  (most current)

>> http://www.archivum.info/linux-ide@vger.kernel.org/2010-02/00243/bad-performance-with-SSD-since-kernel-version-2.6.32.html
>> http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=642729
>>
>> It was reverted in 2.6.33 upstream, and supposedly reverted in 2.6.32
>> upstream stable.  But apparently, according to Ben Hutchings last
>> comment in the Debian bug report, it hasn't been fixed yet in Debian
>> 2.6.32, and that was over a year ago.
> 
> That sounds like the right solution, would seem to make a massive difference...

2-3x greater throughput per SSD.  Yeah, slight improvement. ;)

> However, apparently there is another bug in kernels newer than 2.6.32 which causes iscsi to fail... When I previously upgraded the kernel to 3.2.x to fix an LVM problem I had caused, I noticed iscsi didn't work, so I reverted to 2.6.32 after I finished fixing the LVM issue.

I don't find any issues WRT iscsi in linux-image-3.2.0-0.bpo.4-amd64
(3.2.35-2~bpo60+1)

> I'll have to watch out for that when I get to it.

That seems a bit nonchalant given you started this thread due to low
RAID performance. ;)  Makes sense though, given that the actual cause of
the user complaints was elsewhere-- in the network.

>> BTW, did you get the LSI card into the server yet, and if so have you
>> run any FIO tests to compare with the previous onboard SATA2 numbers
>> you posted?  Would be interesting to see what performance advantage, if
>> any, it has over the onboard SATA2, and would also allow us to see just how
>> much difference the new kernel makes.
> 
> Well, I'm guessing it won't make much difference given the above bug, but yes, I did install the new LSI card at the same time as the extra ethernet ports....

I assure you it made a difference.  I'm curious how much.  Given all the
testing you did early on I figured you'd want to know this as well.

>> Also, which LSI board did you get, the 9211 or 9307?  You simply said
>> you got the "recommended" one, or something like that.
> 
> The 9307-8i, the one that my supplier couldn't find initially, that you said was the best option because it specifically supported the high IOPS that the SSD's were capable of.

I typo'd previously, it's actually 9207.

Excellent.  If you expand to 8 of these Intel 520s it can't handle the
combined 400K IOPS easily.  One more thing makes running DRBD
continuously possible.

> I'll have to upgrade the kernel, but will need to be careful, I need to make sure the new kernel will work properly for both DRBD and iscsi ....

The only DRBD issue I see related to linux-image-3.2.0-0.bpo.4-amd64
(3.2.35-2~bpo60+1) is false reporting of high disk utilization.  Not a
show stopper:  http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=698450

> Another option would be to manually compile my own kernel and just revert that specific patch myself, but I really dislike doing that, I prefer to use the standard packages.....

Yes, it's more difficult and time consuming to support rolled customer
kernels.  In this case it shouldn't be necessary given you're using high
volume COTS hardware and widely used protocols (iscsi).

-- 
Stan


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance
  2013-02-17  1:42                                           ` Stan Hoeppner
@ 2013-02-17  5:02                                             ` Adam Goryachev
  2013-02-17  6:28                                               ` Stan Hoeppner
  0 siblings, 1 reply; 131+ messages in thread
From: Adam Goryachev @ 2013-02-17  5:02 UTC (permalink / raw)
  To: stan; +Cc: Dave Cundiff, linux-raid

Stan Hoeppner <stan@hardwarefreak.com> wrote:

>On 2/16/2013 11:19 AM, Adam Goryachev wrote:
>
>> OK, I don't think any of this is going to work properly... I have 11
>targets at the moment, so with two interfaces on the xen box, and 2
>ip's on the san, it is going to have 4 paths per target. So I need 44
>paths, but after 32 it times out all the rest of them when doing a
>session login. I don't see how to reduce this any further without going
>back to the old 1Gbps maximum performance level (and still use MPIO).
>I'd have to limit which targets can be seen so only a max of 8 can be
>seen by any host. This will only get worse if I manage to get it all
>working and then add more VM's to the system, I could easily end up
>with 20 targets.
>...
>> Well, I didn't even get to this... actually, exposing all 8 IP's on
>san1 produced 16 paths per Target, and I did see problems trying to get
>that working, which is why I dropped down to the 4 paths above.
>...
>> So it seems even this won't work, because I will still have 4 paths
>per target... Which brings me back, to square one....
>> 
>> I need both xen ports in a single bond, each group of 4 ports on san1
>in a bond, this provides 2 paths per target (or san could be 8 ports in
>one bond, and xen could use two interfaces individually), and then I
>can get up to 16 targets which at least lets me get things working now,
>and potentially scales a little bit further.
>> 
>> Maybe it is unusual for people to use so many targets, or
>something... I can't seem to find anything on google about this limit,
>which seems to be pretty low :(
>
>I regret suggesting this.  This can work in some scenarios, but it is
>simply not needed in this one, whether you could get it to work or not.
>
>> I don't understand this.... MPIO to all 8 ports would have scaled the
>best
>
>*Theoretically* yes.  But your workload isn't theoretical.  You don't
>design a network, user or SAN, based on balancing theoretical maximums
>of each channel.  You design it based on actual user workload data
>flows.  In reality, 2 ports of b/w in your SAN server is far more than
>sufficient for your current workload, and is even sufficient for double
>your current workload.  And nobody outside HPC designs for peak user
>throughput, but average throughput, unless the budget is unlimited, and
>most are not.
>
>> However, using the 4 path per target method will limit performance
>depending on who those paths are shared with.
>
>See above regarding actual workload data flows.  You're still in
>theoretical land.
>
>> Using balance-alb...
>
>Forget using Linux bonding for SAN traffic.  It's a non starter.
>
>> What am I missing or not seeing? I'm sure I'm blinded by having tried
>so many different things now...
>
>You're allowing theory to blind you from reality.  You're looking for
>something that's perfect instead of more than sufficient.
>
>> I just don't see why this didn't work for me.... I didn't even find
>an option to adjust this maximum limit. I only assume it is a limit at
>this stage...
>
>Forget it.  Configure the 2:2 and be done with it.
>
>>>> I really need to get this done by Monday after today
>
>One more reason to go with the standard 2:2 setup.

That's the problem, even the 2:2 setup doesn't work.

Two ethernet interfaces on the xen client x 2 IP's on the san server equals 4 paths, times 11 targets equals 44 paths total, and the linux iscsi-target (ietd) only supports a maximum of 32 on the version I'm using. I did actually find the details of this limit:
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=687619

As much as i like debian stable, it is really annoying to keep finding that you are affected so severely by known bugs, that have been known for over a year (snip whinging).

So I've currently left it with 8 x ports in bond0 using balance-alb, and each client using MPIO with 2 interfaces to each target (total 22 paths). I ran a quick dd read test from each client simultaneously, and the minimum read speed was 98MB/s, with a single client max speed was around 180MB/s.

So, will see how this goes this week, then will try to upgrade the kernel, and also upgrade the iscsi target to fix both bugs and can then change back to MPIO with 4 paths (2:2).

In fact, I suspect a significant part of this entire project performance issue could be attributed to the kernel bug. The user who reported the issue was getting slower performance from the SSD compared to an old HDD, and I'm losing a significant amount of performance from it (as you said, even 1Gbps should probably be sufficient).

I'll probably test the upgrade to debian testing on the secondary san during the week, then if that is successful, I can repeat the process on the primary.

Regards,
Adam


--
Adam Goryachev
Website Managers
www.websitemanagers.com.au

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance
  2013-02-17  5:02                                             ` Adam Goryachev
@ 2013-02-17  6:28                                               ` Stan Hoeppner
  2013-02-17  8:41                                                 ` Adam Goryachev
  0 siblings, 1 reply; 131+ messages in thread
From: Stan Hoeppner @ 2013-02-17  6:28 UTC (permalink / raw)
  To: Adam Goryachev; +Cc: Dave Cundiff, linux-raid

On 2/16/2013 11:02 PM, Adam Goryachev wrote:
> Stan Hoeppner <stan@hardwarefreak.com> wrote:

>> One more reason to go with the standard 2:2 setup.
> 
> That's the problem, even the 2:2 setup doesn't work.

You're misunderstanding what I meant by "2:2".  This simply means two
client ports linked to two server ports.  The way this is done properly
is for each initiator interface to only login to the LUNs at one remote
interface.  The result is each client interface only logs into 11 LUNs.
 That's 22 total sessions and puts you under the 32 limit of the 2.6.32
Squeeze kernel.

Correct configuration:

Client              Server
192.168.101.11 ---> 192.168.101.1 LUNs 0,1,2,3,4,5,6,7,8,9,10
192.168.101.12 ---> 192.168.101.2 LUNs 0,1,2,3,4,5,6,7,8,9,10

It sounds like what you're doing is this:

Client              Server
192.168.101.11 ---> 192.168.101.1 LUNs 0,1,2,3,4,5,6,7,8,9,10
192.168.101.11 ---> 192.168.101.2 LUNs 0,1,2,3,4,5,6,7,8,9,10

192.168.101.12 ---> 192.168.101.1 LUNs 0,1,2,3,4,5,6,7,8,9,10
192.168.101.12 ---> 192.168.101.2 LUNs 0,1,2,3,4,5,6,7,8,9,10

Note that the 2nd set of 11 LUN logins from each client interface serves
ZERO purpose.  You gain neither added redundancy nor bandwidth by doing
this.  I mentioned this in a previous email.  Again, all it does is eat
up your available sessions.

> Two ethernet interfaces on the xen client x 2 IP's on the san server equals 4 paths, times 11 targets equals 44 paths total, and the linux iscsi-target (ietd) only supports a maximum of 32 on the version I'm using. I did actually find the details of this limit:
> http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=687619

First, this bug isn't a path issue but a session issue.  Session = LUN
login.  Thus I'd guess you have a different problem.  Posting errors
from logs would be helpful.  That may not even be necessary though,
here's why:

You've told us that in production you have 8 client machines each with
one initiator, the links being port-to-port direct to the server's 8
ports.  You're having each client interface login to 11 LUNs.  That's
*88 sessions* at the target.  This Squeeze "bug" is triggered at 32
sessions.  Thus if your problem was this bug it would have triggered in
production before you started testing w/2 interfaces on this one client box.

Thus, it would seem the problem here is actually that the iscsi-target
code simply doesn't like seeing one initiator attempting to log into the
same 11 LUNs on two different interfaces.

> As much as i like debian stable, it is really annoying to keep finding that you are affected so severely by known bugs, that have been known for over a year (snip whinging).

This is why backports exists.  The latest backport kernel has both of
these fixes, though again, it doesn't appear the iscsi "bug" is
affecting you, but something else.

> So I've currently left it with 8 x ports in bond0 using balance-alb, and each client using MPIO with 2 interfaces to each target (total 22 paths). I ran a quick dd read test from each client simultaneously, and the minimum read speed was 98MB/s, with a single client max speed was around 180MB/s.

This makes no sense at all.  First, what does "8 x ports in bond0 using
balance-alb" mean?  And, with 8 client machines that's 176 sessions, not
22.  The Debian Squeeze 2.6.32 bug is due to concurrent sessions at the
iscsi-target exceeding 32.  Here you ssem to be telling us you have 176
sessions...

> So, will see how this goes this week, then will try to upgrade the kernel, and also upgrade the iscsi target to fix both bugs and can then change back to MPIO with 4 paths (2:2).
> 
> In fact, I suspect a significant part of this entire project performance issue could be attributed to the kernel bug. The user who reported the issue was getting slower performance from the SSD compared to an old HDD, and I'm losing a significant amount of performance from it (as you said, even 1Gbps should probably be sufficient).

It seems pretty clear the SSD bug is affecting you.  However it seems
your iSCSI issues are unrelated to the iSCSI "bug".

> I'll probably test the upgrade to debian testing on the secondary san during the week, then if that is successful, I can repeat the process on the primary.

It takes a couple of minutes max to install the BPO kernel on san1.  It
takes about the same to remove the grub boot entry and reboot to the old
kernel if you have problems with it (which is very unlikely).

It seems strange that you'd do a distro upgrade on the backup server
simply to see if a new kernel fixes a problem on the primary.

-- 
Stan


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance
  2013-02-17  6:28                                               ` Stan Hoeppner
@ 2013-02-17  8:41                                                 ` Adam Goryachev
  2013-02-17 13:58                                                   ` Stan Hoeppner
  0 siblings, 1 reply; 131+ messages in thread
From: Adam Goryachev @ 2013-02-17  8:41 UTC (permalink / raw)
  To: stan; +Cc: Dave Cundiff, linux-raid

On 17/02/13 17:28, Stan Hoeppner wrote:
> On 2/16/2013 11:02 PM, Adam Goryachev wrote:
>> Stan Hoeppner <stan@hardwarefreak.com> wrote:
> 
>>> One more reason to go with the standard 2:2 setup.
>>
>> That's the problem, even the 2:2 setup doesn't work.
> 
> You're misunderstanding what I meant by "2:2".  This simply means two
> client ports linked to two server ports.  The way this is done properly
> is for each initiator interface to only login to the LUNs at one remote
> interface.  The result is each client interface only logs into 11 LUNs.
>  That's 22 total sessions and puts you under the 32 limit of the 2.6.32
> Squeeze kernel.
> 
> Correct configuration:
> 
> Client              Server
> 192.168.101.11 ---> 192.168.101.1 LUNs 0,1,2,3,4,5,6,7,8,9,10
> 192.168.101.12 ---> 192.168.101.2 LUNs 0,1,2,3,4,5,6,7,8,9,10
> 
> It sounds like what you're doing is this:
> 
> Client              Server
> 192.168.101.11 ---> 192.168.101.1 LUNs 0,1,2,3,4,5,6,7,8,9,10
> 192.168.101.11 ---> 192.168.101.2 LUNs 0,1,2,3,4,5,6,7,8,9,10
> 
> 192.168.101.12 ---> 192.168.101.1 LUNs 0,1,2,3,4,5,6,7,8,9,10
> 192.168.101.12 ---> 192.168.101.2 LUNs 0,1,2,3,4,5,6,7,8,9,10
> 
> Note that the 2nd set of 11 LUN logins from each client interface serves
> ZERO purpose.  You gain neither added redundancy nor bandwidth by doing
> this.  I mentioned this in a previous email.  Again, all it does is eat
> up your available sessions.

OK, in that case, you are correct, I've misunderstood you.

I'm unsure how to configure things to work that way...

I've run the following commands from the URL you posted previously:
http://linfrastructure.blogspot.com.au/2008/02/multipath-and-equallogic-iscsi.html

iscsiadm -m iface -I iface0 --op=new
iscsiadm -m iface -I iface1 --op=new
iscsiadm -m iface -I iface0 --op=update -n iface.hwaddress -v
00:16:3E:XX:XX:XX
iscsiadm -m iface -I iface1 --op=update -n iface.hwaddress -v
00:16:3E:XX:XX:XX

iscsiadm -m discovery -t st -p 10.X.X.X

The above command (discovery) finds 4 paths for each LUN, since it
automatically uses each interface to talk to each LUN. Do you know how
to stop that from happening? If I only allow a connection to a single IP
on the SAN, then it will only use one session from each interface.

>> Two ethernet interfaces on the xen client x 2 IP's on the san server equals 4 paths, times 11 targets equals 44 paths total, and the linux iscsi-target (ietd) only supports a maximum of 32 on the version I'm using. I did actually find the details of this limit:
>> http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=687619
> 
> First, this bug isn't a path issue but a session issue.  Session = LUN
> login.  Thus I'd guess you have a different problem.  Posting errors
> from logs would be helpful.  That may not even be necessary though,
> here's why:
> 
> You've told us that in production you have 8 client machines each with
> one initiator, the links being port-to-port direct to the server's 8
> ports.  You're having each client interface login to 11 LUNs.  That's
> *88 sessions* at the target.  This Squeeze "bug" is triggered at 32
> sessions.  Thus if your problem was this bug it would have triggered in
> production before you started testing w/2 interfaces on this one client box.
> 
> Thus, it would seem the problem here is actually that the iscsi-target
> code simply doesn't like seeing one initiator attempting to log into the
> same 11 LUNs on two different interfaces.

No, not quite. See below.

>> As much as i like debian stable, it is really annoying to keep finding that you are affected so severely by known bugs, that have been known for over a year (snip whinging).
> 
> This is why backports exists.  The latest backport kernel has both of
> these fixes, though again, it doesn't appear the iscsi "bug" is
> affecting you, but something else.
> 
>> So I've currently left it with 8 x ports in bond0 using balance-alb, and each client using MPIO with 2 interfaces to each target (total 22 paths). I ran a quick dd read test from each client simultaneously, and the minimum read speed was 98MB/s, with a single client max speed was around 180MB/s.
> 
> This makes no sense at all.  First, what does "8 x ports in bond0 using
> balance-alb" mean?  And, with 8 client machines that's 176 sessions, not
> 22.  The Debian Squeeze 2.6.32 bug is due to concurrent sessions at the
> iscsi-target exceeding 32.  Here you ssem to be telling us you have 176
> sessions...

The iSCSI bug is limiting the number of sessions that can be setup
within a very short time interval. It isn't a maximum number of
sessions. (I could verify this by disabling the automatic login, and
manually login to each LUN one by one (4 sessions at a time)). This is
why I can have 11 sessions from 8 machines at one time previously,
because only one machine would login at a time (unless they all booted
at exactly the same instant), and each one would only create 11
sessions. Same with the current work-around/setup, only 22 sessions per
machine, so only 22 being logged into at a time.
See this for a perhaps better explanation of the bug (that sort of isn't
a bug, just a default limitation):
http://blog.wpkg.org/2007/09/09/solving-reliability-and-scalability-problems-with-iscsi/

After more reading, it seems there is still no package with this fix
included, 1.4.20.2-10.1 doesn't include it, and that is the most recent
version. The only solution to this would be to re-build the deb src
package with the additional one line patch, but if I get the above
solution (only one login from each interface) then I don't need it anyway.

>> So, will see how this goes this week, then will try to upgrade the kernel, and also upgrade the iscsi target to fix both bugs and can then change back to MPIO with 4 paths (2:2).
>>
>> In fact, I suspect a significant part of this entire project performance issue could be attributed to the kernel bug. The user who reported the issue was getting slower performance from the SSD compared to an old HDD, and I'm losing a significant amount of performance from it (as you said, even 1Gbps should probably be sufficient).
> 
> It seems pretty clear the SSD bug is affecting you.  However it seems
> your iSCSI issues are unrelated to the iSCSI "bug".

Nope, pretty sure the iSCSI bug is the issue... In addition, my
inability to work out how to tell iscsiadm to only create one session
from each interface. Solving this usage issue would get me back on track
and side-step the whole iSCSI bug anyway.

>> I'll probably test the upgrade to debian testing on the secondary san during the week, then if that is successful, I can repeat the process on the primary.
> 
> It takes a couple of minutes max to install the BPO kernel on san1.  It
> takes about the same to remove the grub boot entry and reboot to the old
> kernel if you have problems with it (which is very unlikely).
> 
> It seems strange that you'd do a distro upgrade on the backup server
> simply to see if a new kernel fixes a problem on the primary.

I was considering a complete upgrade to debian testing on the mistaken
assumption that it would include:
1) newer kernel (it does of course)
2) newer iscsitarget (it does, but not new enough)
3) newer drbd (it doesn't, but I'm already using a self compiled version
anyway from the upstream stable release).

So, of course, you are right. I will try a remote upgrade now to the
backport kernel, probably need to rebuild the dkms for iscsi, and
rebuild DRBD. None of which should impact on a remote reboot. Worst
case, it's only a 20 minute drive. This should resolve the SSD
performance, and leaves me with just resolving the usage of iscsiadm.

Thanks for your assistance, and patience with me, I appreciate it :)

Regards,
Adam

-- 
Adam Goryachev
Website Managers
www.websitemanagers.com.au

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance - new kernel results
  2013-02-08 13:58           ` Adam Goryachev
  2013-02-08 21:42             ` Stan Hoeppner
@ 2013-02-17  9:52             ` Adam Goryachev
  2013-02-18 13:20               ` RAID performance - new kernel results - 5x SSD RAID5 Stan Hoeppner
  2013-02-23 15:57               ` RAID performance - new kernel results John Stoffel
  1 sibling, 2 replies; 131+ messages in thread
From: Adam Goryachev @ 2013-02-17  9:52 UTC (permalink / raw)
  To: Dave Cundiff; +Cc: linux-raid

On 09/02/13 00:58, Adam Goryachev wrote:
> On 08/02/13 02:32, Dave Cundiff wrote:
>> On Thu, Feb 7, 2013 at 7:49 AM, Adam Goryachev
>> <mailinglists@websitemanagers.com.au> wrote:
>>>> I definitely see that. See below for a FIO run I just did on one of my RAID10s
> OK, some fio results.
>
> Firstly, this is done against /tmp which is on the single standalone
> Intel SSD used for the rootfs (shows some performance level of the
> chipset I presume):
>
> root@san1:/tmp/testing# fio /root/test.fio
> seq-read: (g=0): rw=read, bs=64K-64K/64K-64K, ioengine=libaio, iodepth=32
> seq-write: (g=1): rw=write, bs=64K-64K/64K-64K, ioengine=libaio, iodepth=32
> Starting 2 processes
> seq-read: Laying out IO file(s) (1 file(s) / 4096MB)
> Jobs: 1 (f=1): [_W] [100.0% done] [0K/137M /s] [0/2133 iops] [eta 00m:00s]
> seq-read: (groupid=0, jobs=1): err= 0: pid=4932
>   read : io=4096MB, bw=518840KB/s, iops=8106, runt=  8084msec
> seq-write: (groupid=1, jobs=1): err= 0: pid=5138
>   write: io=4096MB, bw=136405KB/s, iops=2131, runt= 30749msec
> Run status group 0 (all jobs):
>    READ: io=4096MB, aggrb=518840KB/s, minb=531292KB/s, maxb=531292KB/s,
> mint=8084msec, maxt=8084msec
>
> Run status group 1 (all jobs):
>   WRITE: io=4096MB, aggrb=136404KB/s, minb=139678KB/s, maxb=139678KB/s,
> mint=30749msec, maxt=30749msec
>
> Disk stats (read/write):
>   sda: ios=66570/66363, merge=10297/10453, ticks=259152/993304,
> in_queue=1252592, util=99.34%
>
>
> PS, I'm assuming I should omit the extra output similar to what you
> did.... If I should include all info, I can re-run and provide...
>
> This seems to indicate a read speed of 531M and write of 139M, which to
> me says something is wrong. I thought write speed is slower, but not
> that much slower?
>
> Moving on, I've stopped the secondary DRBD, created a new LV (testlv) of
> 15G, and formatted with ext4, mounted it, and re-run the test:
>
> seq-read: (groupid=0, jobs=1): err= 0: pid=19578
>   read : io=4096MB, bw=640743KB/s, iops=10011, runt=  6546msec
> seq-write: (groupid=1, jobs=1): err= 0: pid=19997
>   write: io=4096MB, bw=208765KB/s, iops=3261, runt= 20091msec
> Run status group 0 (all jobs):
>    READ: io=4096MB, aggrb=640743KB/s, minb=656120KB/s, maxb=656120KB/s,
> mint=6546msec, maxt=6546msec
>
> Run status group 1 (all jobs):
>   WRITE: io=4096MB, aggrb=208765KB/s, minb=213775KB/s, maxb=213775KB/s,
> mint=20091msec, maxt=20091msec
>
> Disk stats (read/write):
>   dm-14: ios=65536/64841, merge=0/0, ticks=206920/469464,
> in_queue=676580, util=98.89%, aggrios=0/0, aggrmerge=0/0, aggrticks=0/0,
> aggrin_queue=0, aggrutil=0.00%
>     drbd2: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=-nan%
>
> dm-14 is the testlv
>
> So, this indicates a max read speed of 656M and write of 213M, again,
> write is very slow (about 30%).
>
> With these figures, just 2 x 1Gbps links would saturate the write
> performance of this RAID5 array.
>
> Finally, changing the fio config file to point filename=/dev/vg0/testlv
> (ie, raw LV, no filesystem):
> seq-read: (groupid=0, jobs=1): err= 0: pid=10986
>   read : io=4096MB, bw=652607KB/s, iops=10196, runt=  6427msec
> seq-write: (groupid=1, jobs=1): err= 0: pid=11177
>   write: io=4096MB, bw=202252KB/s, iops=3160, runt= 20738msec
> Run status group 0 (all jobs):
>    READ: io=4096MB, aggrb=652606KB/s, minb=668269KB/s, maxb=668269KB/s,
> mint=6427msec, maxt=6427msec
>
> Run status group 1 (all jobs):
>   WRITE: io=4096MB, aggrb=202252KB/s, minb=207106KB/s, maxb=207106KB/s,
> mint=20738msec, maxt=20738msec
>
> Not much difference, which I didn't really expect...
>
> So, should I be concerned about these results? Do I need to try to
> re-run these tests at a lower layer (ie, remove DRBD and/or LVM from the
> picture)? Are these meaningless and I should be running a different
> test/set of tests/etc ?

OK, I've upgraded to:
Linux san1 3.2.0-0.bpo.4-amd64 #1 SMP Debian 3.2.35-2~bpo60+1 x86_64
GNU/Linux
I also upgraded to iscsitarget from testing, as there seemed a few fixes
there, even though not the one I might have liked:
ii  iscsitarget                         1.4.20.2-10.1               
iSCSI Enterprise Target userland tools
ii  iscsitarget-dkms                    1.4.20.2-10.1               
iSCSI Enterprise Target kernel module source - dkms version


Then I re-ran the fio tests from above and here is what I get when
testing against an LV which has a snapshot against it:
seq-read: (groupid=0, jobs=1): err= 0: pid=10168
  read : io=4096MB, bw=1920MB/s, iops=30724, runt=  2133msec
seq-write: (groupid=1, jobs=1): err= 0: pid=10169
  write: io=2236MB, bw=38097KB/s, iops=595, runt= 60094msec
Run status group 0 (all jobs):
   READ: io=4096MB, aggrb=1920MB/s, minb=1966MB/s, maxb=1966MB/s,
mint=2133msec, maxt=2133msec

Run status group 1 (all jobs):
  WRITE: io=2236MB, aggrb=38097KB/s, minb=39011KB/s, maxb=39011KB/s,
mint=60094msec, maxt=60094msec

So, 1920MB/s read, that sounds good to me, almost 3 times faster,
however, the write performance is pretty dismal :(

After removing the snapshot, here is another look:
seq-read: (groupid=0, jobs=1): err= 0: pid=10222
  read : io=4096MB, bw=2225MB/s, iops=35598, runt=  1841msec
seq-write: (groupid=1, jobs=1): err= 0: pid=10223
  write: io=4096MB, bw=111666KB/s, iops=1744, runt= 37561msec
Run status group 0 (all jobs):
   READ: io=4096MB, aggrb=2225MB/s, minb=2278MB/s, maxb=2278MB/s,
mint=1841msec, maxt=1841msec

Run status group 1 (all jobs):
  WRITE: io=4096MB, aggrb=111666KB/s, minb=114346KB/s, maxb=114346KB/s,
mint=37561msec, maxt=37561msec

A big improvement, 111MB/s write, and even better reads. However, this
write speed still seems pretty slow.

Another run after stopping the secondary DRBD sync:
seq-read: (groupid=0, jobs=1): err= 0: pid=10708
  read : io=4096MB, bw=2242MB/s, iops=35870, runt=  1827msec
seq-write: (groupid=1, jobs=1): err= 0: pid=10709
  write: io=4096MB, bw=560661KB/s, iops=8760, runt=  7481msec
Run status group 0 (all jobs):
   READ: io=4096MB, aggrb=2242MB/s, minb=2296MB/s, maxb=2296MB/s,
mint=1827msec, maxt=1827msec

Run status group 1 (all jobs):
  WRITE: io=4096MB, aggrb=560660KB/s, minb=574116KB/s, maxb=574116KB/s,
mint=7481msec, maxt=7481msec

Now THAT is what I was hoping to see.... 2,242MB/s read, enough to
saturate 18 x 1Gbps ports... and 560MB/s write, enough for 4.5 x 1Gbps,
which is more than the maximum from 2 machines. So as long as I have the
secondary DRBD disconnected during the day (I do), and don't have any
LVM snapshots (I don't due to performance), then things should be a lot
better.

Now looking back at all this, I think I was probably suffering from a
whole bunch of problems:

1) Write cache enabled on windows
2) iSCSI not configured to properly deal with intermittent/slow
responses, queue forever instead of returning an error
3) Not using multipath IO
4) Server storage performance too slow to keep up (due to kernel bug in
debian stable squeeze/2.6.32)
5) Using LVM snapshots which degraded performance
6) Using DRBD during the day with spinning disks on the secondary
(couldn't keep up, slowed down the primary)
7) Sharing a single ethernet for user traffic and SAN traffic, allowing
one protocol to flood/block the other
8) Using RR bonding with more ports on the SAN than the client, causing
flooding, 802.3X pause frames, etc

I can't say that any one of the above fixed the problem, it has been
getting progressively better as each item has been addressed. I'd like
to think that its very close to done now.
The only thing I still need to do is get rid of the bond0 on the SAN,
change to use 8 individual IP's, and configure the clients to talk to
two of the IP's on the san, but only one over each ethernet interface.

I'd again like to say thanks to all the people who've helped out with
this drama. I did forget to take those photo's, but I'll take some next
time I'm in, I think I did a pretty good job overall, and it looks
reasonably neat (by my standards anyway :)

Regards,
Adam

-- 
Adam Goryachev
Website Managers
www.websitemanagers.com.au


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance
  2013-02-17  8:41                                                 ` Adam Goryachev
@ 2013-02-17 13:58                                                   ` Stan Hoeppner
  2013-02-17 14:46                                                     ` Adam Goryachev
  0 siblings, 1 reply; 131+ messages in thread
From: Stan Hoeppner @ 2013-02-17 13:58 UTC (permalink / raw)
  To: Adam Goryachev; +Cc: Dave Cundiff, linux-raid

On 2/17/2013 2:41 AM, Adam Goryachev wrote:
> On 17/02/13 17:28, Stan Hoeppner wrote:

> OK, in that case, you are correct, I've misunderstood you.

This is my fault.  I should have explained explained that better.  I
left it ambiguous.

> I'm unsure how to configure things to work that way...
> 
> I've run the following commands from the URL you posted previously:
> http://linfrastructure.blogspot.com.au/2008/02/multipath-and-equallogic-iscsi.html
> 
> iscsiadm -m iface -I iface0 --op=new
> iscsiadm -m iface -I iface1 --op=new
> iscsiadm -m iface -I iface0 --op=update -n iface.hwaddress -v
> 00:16:3E:XX:XX:XX
> iscsiadm -m iface -I iface1 --op=update -n iface.hwaddress -v
> 00:16:3E:XX:XX:XX
> 
> iscsiadm -m discovery -t st -p 10.X.X.X
> 
> The above command (discovery) finds 4 paths for each LUN, since it
> automatically uses each interface to talk to each LUN. Do you know how
> to stop that from happening? If I only allow a connection to a single IP
> on the SAN, then it will only use one session from each interface.

This is what LUN masking is for.  I haven't seen your your target
configuration, whether you're just using ietd.conf for access control,
or if you're using column 4 in target defs in /etc/iscsi/targets.  So I
can't help you setup your masking at this point.  It'll be complicated
no matter what, as you are apparently currently allowing the world to
see the LUNs.

Since you're not yet familiar with masking, simply use --interface and
--portal with iscsiadm to discover and log into LUNs manually on a 1:1
port basis.  This can be easily scripted.  See the man page for details.

> No, not quite. See below.
...
> The iSCSI bug is limiting the number of sessions that can be setup
> within a very short time interval. It isn't a maximum number of
> sessions. (I could verify this by disabling the automatic login, and
> manually login to each LUN one by one (4 sessions at a time)). This is
> why I can have 11 sessions from 8 machines at one time previously,
> because only one machine would login at a time (unless they all booted
> at exactly the same instant), and each one would only create 11
> sessions. Same with the current work-around/setup, only 22 sessions per
> machine, so only 22 being logged into at a time.
> See this for a perhaps better explanation of the bug (that sort of isn't
> a bug, just a default limitation):
> http://blog.wpkg.org/2007/09/09/solving-reliability-and-scalability-problems-with-iscsi/

Got it.  Getting the issue above fixed also solves this problem, at
least to a degree.

> After more reading, it seems there is still no package with this fix
> included, 1.4.20.2-10.1 doesn't include it, and that is the most recent
> version. The only solution to this would be to re-build the deb src
> package with the additional one line patch, but if I get the above
> solution (only one login from each interface) then I don't need it anyway.

Yes, you're right.  Apparently I didn't read the report thoroughly.
It's a logins per unit time issue.  Fixing the excess logins should
mitigate this pretty well.

>>> So, will see how this goes this week, then will try to upgrade the kernel, and also upgrade the iscsi target to fix both bugs and can then change back to MPIO with 4 paths (2:2).
>>>
>>> In fact, I suspect a significant part of this entire project performance issue could be attributed to the kernel bug. The user who reported the issue was getting slower performance from the SSD compared to an old HDD, and I'm losing a significant amount of performance from it (as you said, even 1Gbps should probably be sufficient).

Yep.  Separating iSCSI traffic on the DC to another link seems to have
helped quite a bit.  But my, oh my, that 3x plus increase in SSD
throughput surely will help.  I'm still curious as to how much of that
was the LSI and how much was the kernel bug fix.

On that note I'm going to start a clean thread regarding your 3x
read/write throughput ratio deficit.

>> It seems pretty clear the SSD bug is affecting you.  However it seems
>> your iSCSI issues are unrelated to the iSCSI "bug".
> 
> Nope, pretty sure the iSCSI bug is the issue... In addition, my
> inability to work out how to tell iscsiadm to only create one session
> from each interface. Solving this usage issue would get me back on track
> and side-step the whole iSCSI bug anyway.

Again, I think you're on money with the iscsi-target 32 limit bug, and
you should be able to whip the sessions into shape with those cli
options.  If not you can dig into masking, which will take a while
longer, but is the standard method for this.

> I was considering a complete upgrade to debian testing on the mistaken
> assumption that it would include:
> 1) newer kernel (it does of course)
> 2) newer iscsitarget (it does, but not new enough)
> 3) newer drbd (it doesn't, but I'm already using a self compiled version
> anyway from the upstream stable release).
> 
> So, of course, you are right. I will try a remote upgrade now to the
> backport kernel, probably need to rebuild the dkms for iscsi, and
> rebuild DRBD. None of which should impact on a remote reboot. Worst
> case, it's only a 20 minute drive. This should resolve the SSD
> performance, and leaves me with just resolving the usage of iscsiadm.
> 
> Thanks for your assistance, and patience with me, I appreciate it :)

I feel privileged to have been of continued assistance Adam. :)

You have a bit of a unique setup there, and the hardware necessary for
some extreme performance.  My heart sank when I saw the IO numbers you
posted and I felt compelled to try to help.  Very few folks have a
storage server, commercial or otherwise, with 2GB/s of read and 650MB/s
of write throughput with 1.8TB of capacity.  Allow me to continue
assisting and we'll get that write number up there with the read.

I've been designing and building servers around channel parts for over
15 years, and I prefer it any day to Dell/HP/IBM etc.  It's nice to see
other folks getting out there on the bleeding edge building ultra high
performance systems with channel gear.  We don't see systems like this
on linux-raid very often.

-- 
Stan


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance
  2013-02-17 13:58                                                   ` Stan Hoeppner
@ 2013-02-17 14:46                                                     ` Adam Goryachev
  2013-02-19  8:17                                                       ` Stan Hoeppner
  0 siblings, 1 reply; 131+ messages in thread
From: Adam Goryachev @ 2013-02-17 14:46 UTC (permalink / raw)
  To: stan; +Cc: Dave Cundiff, linux-raid

On 18/02/13 00:58, Stan Hoeppner wrote:
> On 2/17/2013 2:41 AM, Adam Goryachev wrote:
>> On 17/02/13 17:28, Stan Hoeppner wrote:
> 
>> OK, in that case, you are correct, I've misunderstood you.
> 
> This is my fault.  I should have explained explained that better.  I
> left it ambiguous.
> 
>> I'm unsure how to configure things to work that way...
>>
>> I've run the following commands from the URL you posted previously:
>> http://linfrastructure.blogspot.com.au/2008/02/multipath-and-equallogic-iscsi.html
>>
>> iscsiadm -m iface -I iface0 --op=new
>> iscsiadm -m iface -I iface1 --op=new
>> iscsiadm -m iface -I iface0 --op=update -n iface.hwaddress -v
>> 00:16:3E:XX:XX:XX
>> iscsiadm -m iface -I iface1 --op=update -n iface.hwaddress -v
>> 00:16:3E:XX:XX:XX
>>
>> iscsiadm -m discovery -t st -p 10.X.X.X
>>
>> The above command (discovery) finds 4 paths for each LUN, since it
>> automatically uses each interface to talk to each LUN. Do you know how
>> to stop that from happening? If I only allow a connection to a single IP
>> on the SAN, then it will only use one session from each interface.
> 
> This is what LUN masking is for.  I haven't seen your your target
> configuration, whether you're just using ietd.conf for access control,
> or if you're using column 4 in target defs in /etc/iscsi/targets.  So I
> can't help you setup your masking at this point.  It'll be complicated
> no matter what, as you are apparently currently allowing the world to
> see the LUNs.

I must say, I'm only getting to learn about this... Previously, it was
wide open... the entire user lan had direct access to the iSCSI without
any username/password. As a part of the separation of the user lan from
iSCSI SAN, I also added a iptables rule to block iSCSI connections.
After a bit more investigating, I found /etc/iet/targets.allow where I
could put only the IP of the SAN interface which helped.

Previously, a discover actually was finding a bunch of IP's from the
SAN, including private IP addresses that were on the directly connected
interface for DRBD. I was running a discovery and then some rm commands
to delete the extra files from /etc/iscsi before running the login commands.

After reading the man page for ietd (just command line options) and
ietd.conf (only refers to username/password restrictions), and looking
at the file targets.allow, it doesn't seem to be too easily configured
to block access in that way.

> Since you're not yet familiar with masking, simply use --interface and
> --portal with iscsiadm to discover and log into LUNs manually on a 1:1
> port basis.  This can be easily scripted.  See the man page for details.

I'll start with this method... Haven't looked at the iscsiadm man page
again yet, but I suspect it shouldn't be too hard to work out. I'm also
thinking I could just run the discover and manually delete the
extraneous files the same as I was doing previously. I'll sort this out
next week.

> Yep.  Separating iSCSI traffic on the DC to another link seems to have
> helped quite a bit.  But my, oh my, that 3x plus increase in SSD
> throughput surely will help.  I'm still curious as to how much of that
> was the LSI and how much was the kernel bug fix.

Well, hard to say, but here is the fio test result from the OS drive
before the kernel change:
   READ: io=4096MB, aggrb=518840KB/s, minb=531292KB/s, maxb=531292KB/s,
mint=8084msec, maxt=8084msec
  WRITE: io=4096MB, aggrb=136404KB/s, minb=139678KB/s, maxb=139678KB/s,
mint=30749msec, maxt=30749msec

Disk stats (read/write):
  sda: ios=66570/66363, merge=10297/10453, ticks=259152/993304,
in_queue=1252592, util=99.34%

Here is the same test with the new kernel (note, this SSD is still
connected to the motherboard, I wasn't confident if the HBA drivers were
included in my kernel, when I installed it, etc.

   READ: io=4096MB, aggrb=516349KB/s, minb=528741KB/s, maxb=528741KB/s,
mint=8123msec, maxt=8123msec
  WRITE: io=4096MB, aggrb=143812KB/s, minb=147264KB/s, maxb=147264KB/s,
mint=29165msec, maxt=29165msec

Disk stats (read/write):
  sdf: ios=66509/66102, merge=10342/10537, ticks=260504/937872,
in_queue=1198440, util=99.14%

Interesting that there is very little difference.... I'm not sure why...

It would be interesting to re-test the onboard SATA performance, but I
assure you I really don't want to pull that machine apart again. (Some
insane person mounted it on the rack mount rails upside down!!! So it's
a real pita for something that is supposed to make life easier!

> On that note I'm going to start a clean thread regarding your 3x
> read/write throughput ratio deficit.

Good idea :)

> You have a bit of a unique setup there, and the hardware necessary for
> some extreme performance.  My heart sank when I saw the IO numbers you
> posted and I felt compelled to try to help.  Very few folks have a
> storage server, commercial or otherwise, with 2GB/s of read and 650MB/s
> of write throughput with 1.8TB of capacity.  Allow me to continue
> assisting and we'll get that write number up there with the read.

Well, it wasn't meant to be such a beast of a machine. It was originally
specced with 12 x 1TB 7200rpm drives, using an overland SAN (because I
didn't want to bet my reputation on being able to build up a linux based
home solution for them). When that fell over a number of times, and the
tech support couldn't resolve the issues, and finally it lost all of the
data from one LUN (thank goodness for backups), we sent it back for a
refund. I figured I might as well try and put something together, and
extended it to a dual server setup for just a little extra budget,
except I used 4 x 2TB 7200rpm drives.... I didn't really consider the
concurrent access to different parts of the disk issue. When I looked at
it, these SSD's were about $600 each, and the 2TB drives were about $500
each. So, the options were 8 x 2TB drives in RAID10 (8TB space) or 5 x
SSD's in RAID5 (2TB space). The 2TB capacity was ample, and I preferred
the SSD's since an SSD is designed for random access, while the RAID10
option just increased the number of spindles, and may have not been enough.

So, it has been through some hoops, and has taken some effort, but at
the end of the day, I think we have a much better solution than buying
any off the shelf SAN device, and most definitely get a lot more
flexibility. Eventually the plan is to add a 3rd DRBD node at a remote
office for DR purposes.

> I've been designing and building servers around channel parts for over
> 15 years, and I prefer it any day to Dell/HP/IBM etc.  It's nice to see
> other folks getting out there on the bleeding edge building ultra high
> performance systems with channel gear.  We don't see systems like this
> on linux-raid very often.

I prefer the "channel parts systems" as well, though I was always a bit
afraid to build them for customers just in case it went wrong... I
always build up my own stuff though. Of course, next time I need to do
something like this, I'll have a heck of a lot more knowledge and
confidence to do it.

Regards,
Adam

-- 
Adam Goryachev
Website Managers
www.websitemanagers.com.au

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance - new kernel results - 5x SSD RAID5
  2013-02-17  9:52             ` RAID performance - new kernel results Adam Goryachev
@ 2013-02-18 13:20               ` Stan Hoeppner
  2013-02-20 17:10                 ` Adam Goryachev
  2013-02-23 15:57               ` RAID performance - new kernel results John Stoffel
  1 sibling, 1 reply; 131+ messages in thread
From: Stan Hoeppner @ 2013-02-18 13:20 UTC (permalink / raw)
  To: Adam Goryachev; +Cc: Dave Cundiff, linux-raid

On 2/17/2013 3:52 AM, Adam Goryachev wrote:

>    READ: io=4096MB, aggrb=2242MB/s, minb=2296MB/s, maxb=2296MB/s,
> mint=1827msec, maxt=1827msec

>   WRITE: io=4096MB, aggrb=560660KB/s, minb=574116KB/s, maxb=574116KB/s,
> mint=7481msec, maxt=7481msec

Our read throughput is almost exactly 4x the write throughput.  At the
hardware level, single SSD write throughput should only be ~10% lower
than read.  Sequential writes w/RAID5 should not cause RMW cycles so
that is not in play in these tests.  So why are writes so much slower?
Knowing these things, where should we start looking for our performance
killing needle in this haystack?

We know that the md/RAID5 driver still uses a single write thread in
kernel 3.2.35.  And given we're pushing over 500MB/s through md/RAID5 to
SSD storage, it's possible that this thread is eating all of one CPU
core with both IOs and parity calculations, limiting write throughput.
So that's the first place to look.  For your 7 second test run of FIO,
we could do some crude instrumenting.  Assuming you have top setup to
show individual Cpus (if not hit '1' in interactive mode to get them,
then exit), we can grab top output twice a seconds for 10 seconds, in
another terminal window.  So we do something like the following, giving
3 seconds to switch windows and launch FIO.  (Or one could do it in  a
single window, writing a script to pipe the output of each to a file)

~$ top -b -n 20 -d 0.5 |grep Cpu

yields 28 lines of this for 2 cores, 56 lies for 4 cores.

Cpu0 : 1.2%us, 0.5%sy, 1.8%ni, 96.4%id, 0.1%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu1 : 1.1%us, 0.5%sy, 2.2%ni, 96.1%id, 0.1%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu0 : 1.9%us, 1.9%sy, 0.0%ni, 96.2%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu1 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st

This will give us a good idea of what the cores are doing during the FIO
run, as well as interrupt distribution, which CPUs are handling the
lower level IO threads, how long we're waiting on the SSDs, etc.  If any
core is at 98%+ during the run then md thread starvation is the problem.

(If you have hyperthreading enabled, reboot and disable it.  It normally
decreases thread performance due to scheduling and context switching
overhead, among other things.  Not to mention it makes determining
actual CPU load more difficult.  In this exercise you'll needlessly have
twice as many lines of output to comb through.)

If md is peaking a single core, the next step is to optimize the single
thread performance.  There's not much you can do here but to optimize
the parity calculation rate and tweak buffering.  I'm no expert on this
but others here are.  IIRC you can tweak md to use the floating point
registers and SSEx/AVX instructions.  These FP execution units in the
CPU run in parallel to the integer units, and are 128 vs 64 bits wide
(256 for AVX).  So not only is the number crunching speed increased, but
it's done in parallel to the other instructions.  This makes the integer
units more available.  You should also increase your stripe_cache_size
if you haven't already.  Such optimizations won't help much overall--
we're talking 5-20% maybe-- because the bottleneck lay elsewhere in the
code.  Which brings us to...

The only other way I know of to increase single thread RAID5 write
performance substantially is to grab a very recent kernel and Shaohua
Li's patch set developed specifically for the single write thread
problem on RAID1/10/5/6.  His test numbers show improvements of 130-200%
increasing with drive count, but not linearly.  It is described here:
http://lwn.net/Articles/500200/

With current distro kernels and lots of SSDs, the only way to
significantly improve this single thread write performance is to use
nested md/RAID0 over smaller arrays to increase the thread count and
bring more cores into play.  With this you get one write thread per
constituent array.  Each thread receive one core of performance.  The
stripe over them has no threads and can scale to any numbers of cores.

Assuming you are currently write thread bound at ~560-600MB/s, adding
one more Intel SSD for 6 total gives us...

RAID0 over 3 RAID1, 3 threads-- should yield read speed between 1.5 and
3GB/s depending on load, and increase your write speed to 1.6GB/s, for
the loss of 480G capacity.

RAID0 over 2 RAID5, 2 threads-- should yield between 2.2 and 2.6GB/s
read speed, and increase your write speed to ~1.1GB/s, for no change in
capacity.

Again, these numbers assume the low write performance is due to thread
starvation.

The downside for both:  Neither of these configurations can be expanded
with a reshape and thus drives cannot be added.  That can be achieved by
using a linear layer atop these RAID0 devices, and adding new md devices
to the linear array later.  With this you don't get automatic even
distribution of IO for the linear array, but only for the constituent
striped arrays.  This isn't a bad tradeoff when IO flow analysis and
architectural planning are performed before a system is deployed.

-- 
Stan


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance
  2013-02-17 14:46                                                     ` Adam Goryachev
@ 2013-02-19  8:17                                                       ` Stan Hoeppner
  2013-02-20 16:45                                                         ` Adam Goryachev
  0 siblings, 1 reply; 131+ messages in thread
From: Stan Hoeppner @ 2013-02-19  8:17 UTC (permalink / raw)
  To: Adam Goryachev; +Cc: Dave Cundiff, linux-raid

On 2/17/2013 8:46 AM, Adam Goryachev wrote:

> I'll start with this method... Haven't looked at the iscsiadm man page
> again yet, but I suspect it shouldn't be too hard to work out. I'm also
> thinking I could just run the discover and manually delete the
> extraneous files the same as I was doing previously. I'll sort this out
> next week.

I strongly suggest you read/research and plan beforehand.  If you have
not setup your SAN subnetting and Xen IP SAN ethernet port assignments
correctly, you will not be able to use LUN masking, as it is based on
subnet masks, to show/hide LUNs to initiators.  Which means you'll have
to rip out and redo your IP assignments on the fly.

Building a SAN such as this isn't something that can be done properly
while flying by the seat of your pants.  It takes planning.

> Well, hard to say, but here is the fio test result from the OS drive
> before the kernel change:
>    READ: io=4096MB, aggrb=518840KB/s, minb=531292KB/s, maxb=531292KB/s,
> mint=8084msec, maxt=8084msec
>   WRITE: io=4096MB, aggrb=136404KB/s, minb=139678KB/s, maxb=139678KB/s,
> mint=30749msec, maxt=30749msec

> Disk stats (read/write):
>   sda: ios=66570/66363, merge=10297/10453, ticks=259152/993304,
> in_queue=1252592, util=99.34%

This says /dev/sda

> Here is the same test with the new kernel (note, this SSD is still
> connected to the motherboard, I wasn't confident if the HBA drivers were
> included in my kernel, when I installed it, etc.
> 
>    READ: io=4096MB, aggrb=516349KB/s, minb=528741KB/s, maxb=528741KB/s,
> mint=8123msec, maxt=8123msec
>   WRITE: io=4096MB, aggrb=143812KB/s, minb=147264KB/s, maxb=147264KB/s,
> mint=29165msec, maxt=29165msec
> 
> Disk stats (read/write):
>   sdf: ios=66509/66102, merge=10342/10537, ticks=260504/937872,
> in_queue=1198440, util=99.14%

this says /dev/sdf

> Interesting that there is very little difference.... I'm not sure why...

Is this the same SSD?  Could be test parameters, controller, etc.  SSDs
seem to be a little finicky WRT write queue depth.  Most seem to give
lower seq write performance with a QD of 1 and level off at peak
performance around QD of 3 to 4.  The IO request size plays a role as
well.  Paste your FIO command line as well as the model of this OS SSD.

> It would be interesting to re-test the onboard SATA performance, but I
> assure you I really don't want to pull that machine apart again. (Some
> insane person mounted it on the rack mount rails upside down!!! So it's
> a real pita for something that is supposed to make life easier!

WTF? How did you accomplish the upgrades?  Why didn't you flip it over
at that time?  Wow....

> So, it has been through some hoops, and has taken some effort, but at

Put it through another hoop and get it mounted upright.  I still can't
believe this...  you must be pulling our collective leg.

> the end of the day, I think we have a much better solution than buying
> any off the shelf SAN device, and most definitely get a lot more
> flexibility. 

Definitely cheaper, and more flexible should you need to run a filer
(Samba) directly on the box.  Not NEARLY as easy to setup.  Nexsan has
some nice gear that's a breeze to configure, nice intuitive web GUI.

> Eventually the plan is to add a 3rd DRBD node at a remote
> office for DR purposes.

IIRC, DRBD isn't recommended for remote site use with public networks
due to reliability.  Will you have a GbE metro ethernet connection, or two?

>> I've been designing and building servers around channel parts for over
>> 15 years, and I prefer it any day to Dell/HP/IBM etc.  It's nice to see
>> other folks getting out there on the bleeding edge building ultra high
>> performance systems with channel gear.  We don't see systems like this
>> on linux-raid very often.
> 
> I prefer the "channel parts systems" as well, though I was always a bit
> afraid to build them for customers just in case it went wrong... I

'Whitebox' or 'custom' if you prefer.  Selecting good quality components
with solid manufacturer warranty and technical support, and performing
extensive burn, in is the key to success.  I've had good luck with
SuperMicro mainboards and chassis/backplanes.  Intel server boards are
quality as well, but for many years I've been exclusively AMD for CPUs,
for many reasons.  I do prefer Intel's NICs.

> always build up my own stuff though. Of course, next time I need to do
> something like this, I'll have a heck of a lot more knowledge and
> confidence to do it.

Unless you're always learning/doing new stuff IT gets boring.

-- 
Stan


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance
  2013-02-19  8:17                                                       ` Stan Hoeppner
@ 2013-02-20 16:45                                                         ` Adam Goryachev
  2013-02-21  0:45                                                           ` Stan Hoeppner
  0 siblings, 1 reply; 131+ messages in thread
From: Adam Goryachev @ 2013-02-20 16:45 UTC (permalink / raw)
  To: Stan Hoeppner; +Cc: Dave Cundiff, linux-raid

Stan Hoeppner <stan@hardwarefreak.com> wrote:

>On 2/17/2013 8:46 AM, Adam Goryachev wrote:
>
>> I'll start with this method... Haven't looked at the iscsiadm man
>page
>> again yet, but I suspect it shouldn't be too hard to work out. I'm
>also
>> thinking I could just run the discover and manually delete the
>> extraneous files the same as I was doing previously. I'll sort this
>out
>> next week.
>
>I strongly suggest you read/research and plan beforehand.  If you have
>not setup your SAN subnetting and Xen IP SAN ethernet port assignments
>correctly, you will not be able to use LUN masking, as it is based on
>subnet masks, to show/hide LUNs to initiators.  Which means you'll have
>to rip out and redo your IP assignments on the fly.
>
>Building a SAN such as this isn't something that can be done properly
>while flying by the seat of your pants.  It takes planning.

I'll work this one out, I don't think it is really relevant to the
issues anymore anyway. I can get better than 1Gbps, and don't expect
that to change much with this change.

>> Well, hard to say, but here is the fio test result from the OS drive
>> before the kernel change:
>>    READ: io=4096MB, aggrb=518840KB/s, minb=531292KB/s,
>maxb=531292KB/s,
>> mint=8084msec, maxt=8084msec
>>   WRITE: io=4096MB, aggrb=136404KB/s, minb=139678KB/s,
>maxb=139678KB/s,
>> mint=30749msec, maxt=30749msec
>
>> Disk stats (read/write):
>>   sda: ios=66570/66363, merge=10297/10453, ticks=259152/993304,
>> in_queue=1252592, util=99.34%
>
>This says /dev/sda
>
>> Here is the same test with the new kernel (note, this SSD is still
>> connected to the motherboard, I wasn't confident if the HBA drivers
>were
>> included in my kernel, when I installed it, etc.
>> 
>>    READ: io=4096MB, aggrb=516349KB/s, minb=528741KB/s,
>maxb=528741KB/s,
>> mint=8123msec, maxt=8123msec
>>   WRITE: io=4096MB, aggrb=143812KB/s, minb=147264KB/s,
>maxb=147264KB/s,
>> mint=29165msec, maxt=29165msec
>> 
>> Disk stats (read/write):
>>   sdf: ios=66509/66102, merge=10342/10537, ticks=260504/937872,
>> in_queue=1198440, util=99.14%
>
>this says /dev/sdf

Remember sdb-sdf (the RAID5) were connected to onboard, and are now
connected to the LSI. Linux is seeing the LSI card before the onboard
drive (I assume) so it has now put them as sda-sde, and the onboard
bootup SSD is now sdf... Messed with me when I was looking at some of
the stats/graphs until I realised that too....

>> Interesting that there is very little difference.... I'm not sure
>why...
>
>Is this the same SSD?  Could be test parameters, controller, etc.  SSDs
>seem to be a little finicky WRT write queue depth.  Most seem to give
>lower seq write performance with a QD of 1 and level off at peak
>performance around QD of 3 to 4.  The IO request size plays a role as
>well.  Paste your FIO command line as well as the model of this OS SSD.

Same ssd in both tests. fio command line was just fio test.fio
The fio file was the one posted in this thread by another user as follows:
[global]
bs=64k
ioengine=libaio
iodepth=32
size=4g
direct=1
runtime=60
#directory=/dev/vg0/testlv
filename=/tmp/testing/test

[seq-read]
rw=read
stonewall

[seq-write]
rw=write
stonewall

Note, the "root ssd" is the /tmp/testing/test file, when testing MD
performance on the RAID5 I'm using the /dev/vg0/testlv which is an LV on
the DRBD on the RAID5 (md2), and I do the test with the DRBD disconnected.

>> It would be interesting to re-test the onboard SATA performance, but
>I
>> assure you I really don't want to pull that machine apart again.
>(Some
>> insane person mounted it on the rack mount rails upside down!!! So
>it's
>> a real pita for something that is supposed to make life easier!
>
>WTF? How did you accomplish the upgrades?  Why didn't you flip it over
>at that time?  Wow....

A VERY good question, both of them.... I worked like a mechanic, from
underneath... a real pain I would say. I didn't like the idea of trying
to flip it by myself though, much better with someone else to help in
the process.... I think my supplier gives me crappy rails/solutions,
because they are always a pain to get them installed....

>> So, it has been through some hoops, and has taken some effort, but at
>
>Put it through another hoop and get it mounted upright.  I still can't
>believe this...  you must be pulling our collective leg.

Nope... I'll try and get to it, but it will be a while before I can take
it offline and have someone who can help fix it up... Maybe I'll call a
friend on the weekend ...

>> the end of the day, I think we have a much better solution than
>buying
>> any off the shelf SAN device, and most definitely get a lot more
>> flexibility. 
>
>Definitely cheaper, and more flexible should you need to run a filer
>(Samba) directly on the box.  Not NEARLY as easy to setup.  Nexsan has
>some nice gear that's a breeze to configure, nice intuitive web GUI.

The breeze to configure part would be nice :)

>> Eventually the plan is to add a 3rd DRBD node at a remote
>> office for DR purposes.
>
>IIRC, DRBD isn't recommended for remote site use with public networks
>due to reliability.  Will you have a GbE metro ethernet connection, or
>two?

We have 10Mbps private connection. I think we can license the DRBD proxy
which should handle the sync over a slower network. The main issue with
DRBD is when you are not using the DRBD proxy.... The connection itself
is very reliable though, just a matter of the bandwidth and whether it
will be sufficient. I'll test beforehand by using either the switch or
linux to configure a slower connection (maybe 7M or something), and see
if it will work reasonably.

>>> I've been designing and building servers around channel parts for
>over
>>> 15 years, and I prefer it any day to Dell/HP/IBM etc.  It's nice to
>see
>>> other folks getting out there on the bleeding edge building ultra
>high
>>> performance systems with channel gear.  We don't see systems like
>this
>>> on linux-raid very often.
>> 
>> I prefer the "channel parts systems" as well, though I was always a
>bit
>> afraid to build them for customers just in case it went wrong... I
>
>'Whitebox' or 'custom' if you prefer.  Selecting good quality
>components
>with solid manufacturer warranty and technical support, and performing
>extensive burn, in is the key to success.  I've had good luck with
>SuperMicro mainboards and chassis/backplanes.  Intel server boards are
>quality as well, but for many years I've been exclusively AMD for CPUs,
>for many reasons.  I do prefer Intel's NICs.

I've preferred AMD for years, but my supplier always prefers Intel, and
for systems like this they get much better warranty support for Intel
compared to almost any other brand, so I generally end up with Intel
boards and CPU's for "important" servers... Always heard good things
about Intel NIC's ...

>> always build up my own stuff though. Of course, next time I need to
>do
>> something like this, I'll have a heck of a lot more knowledge and
>> confidence to do it.
>
>Unless you're always learning/doing new stuff IT gets boring.

Very true.

Regards,
Adam

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance - new kernel results - 5x SSD RAID5
  2013-02-18 13:20               ` RAID performance - new kernel results - 5x SSD RAID5 Stan Hoeppner
@ 2013-02-20 17:10                 ` Adam Goryachev
  2013-02-21  6:04                   ` Stan Hoeppner
  0 siblings, 1 reply; 131+ messages in thread
From: Adam Goryachev @ 2013-02-20 17:10 UTC (permalink / raw)
  To: stan; +Cc: Dave Cundiff, linux-raid

On 19/02/13 00:20, Stan Hoeppner wrote:
> On 2/17/2013 3:52 AM, Adam Goryachev wrote:
> 
>>    READ: io=4096MB, aggrb=2242MB/s, minb=2296MB/s, maxb=2296MB/s,
>> mint=1827msec, maxt=1827msec
> 
>>   WRITE: io=4096MB, aggrb=560660KB/s, minb=574116KB/s, maxb=574116KB/s,
>> mint=7481msec, maxt=7481msec
> 
> Our read throughput is almost exactly 4x the write throughput.  At the
> hardware level, single SSD write throughput should only be ~10% lower
> than read.  Sequential writes w/RAID5 should not cause RMW cycles so
> that is not in play in these tests.  So why are writes so much slower?
> Knowing these things, where should we start looking for our performance
> killing needle in this haystack?
> 
> We know that the md/RAID5 driver still uses a single write thread in
> kernel 3.2.35.  And given we're pushing over 500MB/s through md/RAID5 to
> SSD storage, it's possible that this thread is eating all of one CPU
> core with both IOs and parity calculations, limiting write throughput.
> So that's the first place to look.  For your 7 second test run of FIO,
> we could do some crude instrumenting.  Assuming you have top setup to
> show individual Cpus (if not hit '1' in interactive mode to get them,
> then exit), we can grab top output twice a seconds for 10 seconds, in
> another terminal window.  So we do something like the following, giving
> 3 seconds to switch windows and launch FIO.  (Or one could do it in  a
> single window, writing a script to pipe the output of each to a file)
> 
> ~$ top -b -n 20 -d 0.5 |grep Cpu
> 
> yields 28 lines of this for 2 cores, 56 lies for 4 cores.
> 
> Cpu0 : 1.2%us, 0.5%sy, 1.8%ni, 96.4%id, 0.1%wa, 0.0%hi, 0.0%si, 0.0%st
> Cpu1 : 1.1%us, 0.5%sy, 2.2%ni, 96.1%id, 0.1%wa, 0.0%hi, 0.0%si, 0.0%st
> Cpu0 : 1.9%us, 1.9%sy, 0.0%ni, 96.2%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
> Cpu1 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
> 
> This will give us a good idea of what the cores are doing during the FIO
> run, as well as interrupt distribution, which CPUs are handling the
> lower level IO threads, how long we're waiting on the SSDs, etc.  If any
> core is at 98%+ during the run then md thread starvation is the problem.

Didn't quite work, I had to run the top command like this:
top -n20 -d 0.5 | grep Cpu
Then press 1 after it started, it didn't save the state when running it
interactively and then exiting.
Output is as follows:
Cpu0  :  0.1%us,  2.3%sy,  0.0%ni, 94.3%id,  3.0%wa,  0.0%hi,  0.3%si,
0.0%st
Cpu1  :  0.1%us,  0.5%sy,  0.0%ni, 98.5%id,  0.9%wa,  0.0%hi,  0.0%si,
0.0%st
Cpu2  :  0.1%us,  0.2%sy,  0.0%ni, 99.3%id,  0.4%wa,  0.0%hi,  0.0%si,
0.0%st
Cpu3  :  0.0%us,  0.3%sy,  0.0%ni, 97.8%id,  1.9%wa,  0.0%hi,  0.0%si,
0.0%st
Cpu4  :  0.1%us,  0.1%sy,  0.0%ni, 99.7%id,  0.0%wa,  0.0%hi,  0.0%si,
0.0%st
Cpu5  :  0.1%us,  0.3%sy,  0.0%ni, 98.0%id,  1.5%wa,  0.0%hi,  0.0%si,
0.0%st
Cpu6  :  0.1%us,  0.1%sy,  0.0%ni, 97.4%id,  2.4%wa,  0.0%hi,  0.0%si,
0.0%st
Cpu7  :  0.0%us,  0.1%sy,  0.0%ni, 99.6%id,  0.2%wa,  0.0%hi,  0.0%si,
0.0%st

Cpu0  :  0.0%us, 47.9%sy,  0.0%ni, 50.0%id,  0.0%wa,  0.0%hi,  2.1%si,
0.0%st
Cpu1  :  0.0%us,  2.0%sy,  0.0%ni, 98.0%id,  0.0%wa,  0.0%hi,  0.0%si,
0.0%st
Cpu2  :  2.0%us, 35.3%sy,  0.0%ni,  0.0%id, 62.7%wa,  0.0%hi,  0.0%si,
0.0%st
Cpu3  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,
0.0%st
Cpu4  :  0.0%us,  3.8%sy,  0.0%ni, 96.2%id,  0.0%wa,  0.0%hi,  0.0%si,
0.0%st
Cpu5  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,
0.0%st
Cpu6  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,
0.0%st
Cpu7  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,
0.0%st

Cpu0  :  0.0%us, 37.3%sy,  0.0%ni, 62.7%id,  0.0%wa,  0.0%hi,  0.0%si,
0.0%st
Cpu1  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,
0.0%st
Cpu2  :  0.0%us, 13.7%sy,  0.0%ni, 52.9%id, 33.3%wa,  0.0%hi,  0.0%si,
0.0%st
Cpu3  :  2.0%us, 12.0%sy,  0.0%ni, 46.0%id, 40.0%wa,  0.0%hi,  0.0%si,
0.0%st
Cpu4  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,
0.0%st
Cpu5  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,
0.0%st
Cpu6  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,
0.0%st
Cpu7  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,
0.0%st

Cpu0  :  0.0%us, 26.0%sy,  0.0%ni, 44.0%id, 26.0%wa,  0.0%hi,  4.0%si,
0.0%st
Cpu1  :  0.0%us,  7.7%sy,  0.0%ni, 82.7%id,  9.6%wa,  0.0%hi,  0.0%si,
0.0%st
Cpu2  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,
0.0%st
Cpu3  :  0.0%us,  4.0%sy,  0.0%ni, 86.0%id, 10.0%wa,  0.0%hi,  0.0%si,
0.0%st
Cpu4  :  1.9%us,  1.9%sy,  0.0%ni, 96.2%id,  0.0%wa,  0.0%hi,  0.0%si,
0.0%st
Cpu5  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,
0.0%st
Cpu6  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,
0.0%st
Cpu7  :  0.0%us,  3.9%sy,  0.0%ni, 13.7%id, 82.4%wa,  0.0%hi,  0.0%si,
0.0%st

Cpu0  :  0.0%us, 10.2%sy,  0.0%ni, 51.0%id, 38.8%wa,  0.0%hi,  0.0%si,
0.0%st
Cpu1  :  0.0%us,  2.0%sy,  0.0%ni, 86.0%id, 12.0%wa,  0.0%hi,  0.0%si,
0.0%st
Cpu2  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,
0.0%st
Cpu3  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,
0.0%st
Cpu4  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,
0.0%st
Cpu5  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,
0.0%st
Cpu6  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,
0.0%st
Cpu7  :  0.0%us,  0.7%sy,  0.0%ni, 66.2%id, 33.1%wa,  0.0%hi,  0.0%si,
0.0%st

Cpu0  :  0.0%us, 15.7%sy,  0.0%ni, 39.2%id, 41.2%wa,  0.0%hi,  3.9%si,
0.0%st
Cpu1  :  0.0%us,  4.0%sy,  0.0%ni, 82.0%id, 14.0%wa,  0.0%hi,  0.0%si,
0.0%st
Cpu2  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,
0.0%st
Cpu3  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,
0.0%st
Cpu4  :  1.9%us,  1.9%sy,  0.0%ni, 96.2%id,  0.0%wa,  0.0%hi,  0.0%si,
0.0%st
Cpu5  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,
0.0%st
Cpu6  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,
0.0%st
Cpu7  :  0.0%us,  0.7%sy,  0.0%ni, 66.7%id, 32.7%wa,  0.0%hi,  0.0%si,
0.0%st

Cpu0  :  0.0%us, 12.2%sy,  0.0%ni, 55.1%id, 32.7%wa,  0.0%hi,  0.0%si,
0.0%st
Cpu1  :  0.0%us,  6.0%sy,  0.0%ni, 56.0%id, 38.0%wa,  0.0%hi,  0.0%si,
0.0%st
Cpu2  :  0.0%us,  1.9%sy,  0.0%ni, 98.1%id,  0.0%wa,  0.0%hi,  0.0%si,
0.0%st
Cpu3  :  0.0%us,  1.9%sy,  0.0%ni, 98.1%id,  0.0%wa,  0.0%hi,  0.0%si,
0.0%st
Cpu4  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,
0.0%st
Cpu5  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,
0.0%st
Cpu6  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,
0.0%st
Cpu7  :  0.0%us,  0.7%sy,  0.0%ni, 66.2%id, 33.1%wa,  0.0%hi,  0.0%si,
0.0%st

Cpu0  :  0.0%us, 12.5%sy,  0.0%ni, 41.7%id, 45.8%wa,  0.0%hi,  0.0%si,
0.0%st
Cpu1  :  0.0%us,  6.2%sy,  0.0%ni, 89.6%id,  4.2%wa,  0.0%hi,  0.0%si,
0.0%st
Cpu2  :  0.0%us,  0.0%sy,  0.0%ni, 62.0%id, 38.0%wa,  0.0%hi,  0.0%si,
0.0%st
Cpu3  :  0.0%us,  2.0%sy,  0.0%ni, 78.4%id, 19.6%wa,  0.0%hi,  0.0%si,
0.0%st
Cpu4  :  0.0%us,  3.8%sy,  0.0%ni, 96.2%id,  0.0%wa,  0.0%hi,  0.0%si,
0.0%st
Cpu5  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,
0.0%st
Cpu6  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,
0.0%st
Cpu7  :  0.0%us,  1.9%sy,  0.0%ni, 57.7%id, 40.4%wa,  0.0%hi,  0.0%si,
0.0%st

Cpu0  :  0.0%us, 13.7%sy,  0.0%ni, 33.3%id, 51.0%wa,  0.0%hi,  2.0%si,
0.0%st
Cpu1  :  0.0%us,  7.8%sy,  0.0%ni, 80.4%id, 11.8%wa,  0.0%hi,  0.0%si,
0.0%st
Cpu2  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,
0.0%st
Cpu3  :  0.0%us,  0.0%sy,  0.0%ni, 66.7%id, 33.3%wa,  0.0%hi,  0.0%si,
0.0%st
Cpu4  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,
0.0%st
Cpu5  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,
0.0%st
Cpu6  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,
0.0%st
Cpu7  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,
0.0%st

Cpu0  :  0.0%us, 10.4%sy,  0.0%ni, 25.0%id, 62.5%wa,  0.0%hi,  2.1%si,
0.0%st
Cpu1  :  0.0%us,  8.0%sy,  0.0%ni, 88.0%id,  4.0%wa,  0.0%hi,  0.0%si,
0.0%st
Cpu2  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,
0.0%st
Cpu3  :  0.0%us,  0.7%sy,  0.0%ni, 66.7%id, 32.7%wa,  0.0%hi,  0.0%si,
0.0%st
Cpu4  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,
0.0%st
Cpu5  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,
0.0%st
Cpu6  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,
0.0%st
Cpu7  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,
0.0%st

Cpu0  :  0.0%us,  6.5%sy,  0.0%ni, 21.7%id, 71.7%wa,  0.0%hi,  0.0%si,
0.0%st
Cpu1  :  0.0%us,  7.8%sy,  0.0%ni, 88.2%id,  3.9%wa,  0.0%hi,  0.0%si,
0.0%st
Cpu2  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,
0.0%st
Cpu3  :  0.0%us,  0.0%sy,  0.0%ni, 66.7%id, 33.3%wa,  0.0%hi,  0.0%si,
0.0%st
Cpu4  :  1.9%us,  1.9%sy,  0.0%ni, 96.2%id,  0.0%wa,  0.0%hi,  0.0%si,
0.0%st
Cpu5  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,
0.0%st
Cpu6  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,
0.0%st
Cpu7  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,
0.0%st

Cpu0  :  0.0%us, 14.0%sy,  0.0%ni, 34.0%id, 52.0%wa,  0.0%hi,  0.0%si,
0.0%st
Cpu1  :  0.0%us,  9.8%sy,  0.0%ni, 80.4%id,  9.8%wa,  0.0%hi,  0.0%si,
0.0%st
Cpu2  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,
0.0%st
Cpu3  :  0.0%us,  1.3%sy,  0.0%ni, 65.8%id, 32.9%wa,  0.0%hi,  0.0%si,
0.0%st
Cpu4  :  0.0%us,  2.0%sy,  0.0%ni, 98.0%id,  0.0%wa,  0.0%hi,  0.0%si,
0.0%st
Cpu5  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,
0.0%st
Cpu6  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,
0.0%st
Cpu7  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,
0.0%st

Cpu0  :  0.0%us, 12.5%sy,  0.0%ni, 29.2%id, 58.3%wa,  0.0%hi,  0.0%si,
0.0%st
Cpu1  :  0.0%us,  6.0%sy,  0.0%ni, 86.0%id,  8.0%wa,  0.0%hi,  0.0%si,
0.0%st
Cpu2  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,
0.0%st
Cpu3  :  0.0%us,  0.7%sy,  0.0%ni, 66.2%id, 33.1%wa,  0.0%hi,  0.0%si,
0.0%st
Cpu4  :  2.0%us,  0.0%sy,  0.0%ni, 98.0%id,  0.0%wa,  0.0%hi,  0.0%si,
0.0%st
Cpu5  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,
0.0%st
Cpu6  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,
0.0%st
Cpu7  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,
0.0%st

There was more, very similar figures... apart from the second sample
above, there was never a single Cpu with close to 0% Idle, and I'm
assuming the %CPU in wa state is basically "idle" waiting for the disk
or something else to happen rather than the CPU actually being busy...

> (If you have hyperthreading enabled, reboot and disable it.  It normally
> decreases thread performance due to scheduling and context switching
> overhead, among other things.  Not to mention it makes determining
> actual CPU load more difficult.  In this exercise you'll needlessly have
> twice as many lines of output to comb through.)

I'll have to go in after hours to do that. Hopefully over the weekend
(BIOS setting and no remote KVM)... Can re-supply the results after that
if you think it will make a difference.

> If md is peaking a single core, the next step is to optimize the single
> thread performance.  There's not much you can do here but to optimize
> the parity calculation rate and tweak buffering.  I'm no expert on this
> but others here are.  IIRC you can tweak md to use the floating point
> registers and SSEx/AVX instructions.  These FP execution units in the
> CPU run in parallel to the integer units, and are 128 vs 64 bits wide
> (256 for AVX).  So not only is the number crunching speed increased, but
> it's done in parallel to the other instructions.  This makes the integer
> units more available.  You should also increase your stripe_cache_size
> if you haven't already.  Such optimizations won't help much overall--
> we're talking 5-20% maybe-- because the bottleneck lay elsewhere in the
> code.  Which brings us to...
> 
> The only other way I know of to increase single thread RAID5 write
> performance substantially is to grab a very recent kernel and Shaohua
> Li's patch set developed specifically for the single write thread
> problem on RAID1/10/5/6.  His test numbers show improvements of 130-200%
> increasing with drive count, but not linearly.  It is described here:
> http://lwn.net/Articles/500200/
> 
> With current distro kernels and lots of SSDs, the only way to
> significantly improve this single thread write performance is to use
> nested md/RAID0 over smaller arrays to increase the thread count and
> bring more cores into play.  With this you get one write thread per
> constituent array.  Each thread receive one core of performance.  The
> stripe over them has no threads and can scale to any numbers of cores.
> 
> Assuming you are currently write thread bound at ~560-600MB/s, adding
> one more Intel SSD for 6 total gives us...
> 
> RAID0 over 3 RAID1, 3 threads-- should yield read speed between 1.5 and
> 3GB/s depending on load, and increase your write speed to 1.6GB/s, for
> the loss of 480G capacity.
> 
> RAID0 over 2 RAID5, 2 threads-- should yield between 2.2 and 2.6GB/s
> read speed, and increase your write speed to ~1.1GB/s, for no change in
> capacity.
> 
> Again, these numbers assume the low write performance is due to thread
> starvation.

I don't think it is from my measurements...

> The downside for both:  Neither of these configurations can be expanded
> with a reshape and thus drives cannot be added.  That can be achieved by
> using a linear layer atop these RAID0 devices, and adding new md devices
> to the linear array later.  With this you don't get automatic even
> distribution of IO for the linear array, but only for the constituent
> striped arrays.  This isn't a bad tradeoff when IO flow analysis and
> architectural planning are performed before a system is deployed.

I'll disable the hyperthreading, and re-test afterwards, but I'm not
sure that will produce much of a result. Let me know if you think I
should run any other tests to track it down...

One thing I can see is a large number of interrupts and context switches
which looks like it happened at the same time as a backup run. Perhaps I
am getting too many interrrupts on the network cards or the SATA controller?

Thanks,
Adam

-- 
Adam Goryachev
Website Managers
www.websitemanagers.com.au

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance
  2013-02-20 16:45                                                         ` Adam Goryachev
@ 2013-02-21  0:45                                                           ` Stan Hoeppner
  2013-02-21  3:10                                                             ` Adam Goryachev
  0 siblings, 1 reply; 131+ messages in thread
From: Stan Hoeppner @ 2013-02-21  0:45 UTC (permalink / raw)
  To: Adam Goryachev; +Cc: Dave Cundiff, linux-raid

On 2/20/2013 10:45 AM, Adam Goryachev wrote:
> Stan Hoeppner <stan@hardwarefreak.com> wrote:

> Same ssd in both tests. fio command line was just fio test.fio
> The fio file was the one posted in this thread by another user as follows:
> [global]
> bs=64k
> ioengine=libaio
> iodepth=32

Try dropping the iodepth to 4 and see what that does.

> size=4g
> direct=1
> runtime=60
> #directory=/dev/vg0/testlv
> filename=/tmp/testing/test
> 
> [seq-read]
> rw=read
> stonewall
> 
> [seq-write]
> rw=write
> stonewall
> 
> Note, the "root ssd" is the /tmp/testing/test file, when testing MD
> performance on the RAID5 I'm using the /dev/vg0/testlv which is an LV on
> the DRBD on the RAID5 (md2), and I do the test with the DRBD disconnected.

Yes, and FIO performance to a file is going to be limited by the
filesystem, and specifically the O_DIRECT implementation in that FS.
You may see significantly different results from EXT2/3/4, Reiser, than
from XFS or JFS, and from different kernel and libaio versions as well.
 There are too many layers between FIO and the block device, so it's
difficult to get truly accurate performance data for the underlying device.

And in reality, all of your FIO testing should be random, not
sequential, as your workload is completely random--this is a block IO
server after all with 8 client hosts and I assume hundreds of users.

The proper way to test the capability of your iSCSI target server is to
fire up 8 concurrent FIO tests, one on each Xen box (or VM), each
running 8 threads and using random read/write IO, with each hitting a
different test file residing on a different LUN, while using standard OS
buffered IO.  Run a timed duration test of say 15 seconds.

Testing raw sequential throughput of a device (single SSD or single LUN
atop a single LV on a big mdRAID device) is not informative at all.

>> WTF? How did you accomplish the upgrades?  Why didn't you flip it over
>> at that time?  Wow....
> 
> A VERY good question, both of them.... I worked like a mechanic, from
> underneath... a real pain I would say. I didn't like the idea of trying
> to flip it by myself though, much better with someone else to help in
> the process.... I think my supplier gives me crappy rails/solutions,
> because they are always a pain to get them installed....

I dog ear 2KVA UPSes in the top of 42U racks solo and have never had a
problem mounting any slide rail server.  I often forget that most IT
folks aren't also 6'4" 190lbs, and can't do a lot of racking work solo.

>> Definitely cheaper, and more flexible should you need to run a filer
>> (Samba) directly on the box.  Not NEARLY as easy to setup.  Nexsan has
>> some nice gear that's a breeze to configure, nice intuitive web GUI.
> 
> The breeze to configure part would be nice :)

When you're paying that much premium it better come with some good value
added features.

> We have 10Mbps private connection. I think we can license the DRBD proxy
> which should handle the sync over a slower network. The main issue with
> DRBD is when you are not using the DRBD proxy.... The connection itself
> is very reliable though, 

10Mbps isn't feasible for 2nd site block level replication, with DRBD
proxy or otherwise.  It's probably not even feasible for remote file
based backup.

BTW, what is the business case driver here for off site replication, and
what is the distance to the replication site?  What is the threat
profile to the primary infrastructure?  Earthquake?  Tsunami?  Flash
flooding?  Tornado?  Fire?

I've worked for and done work for businesses of all sizes, the largest
being Mastercard.  Not a one had offsite replication, and few did
network based offsite backup.  Those doing offsite other than network
based performed tape rotation to vault services.

That said, I'm in the US midwest.  And many companies on the East and
West coasts do replication to facilities here.  Off site
replication/backup only makes since when the 2nd facility is immune to
all natural disasters, and hardened against all types of man made ones.
 If you replicate to a site in the same city or region with the same
thread profile, you're pissing in the wind.

Off site replication/backup exists for a singular purpose:

To protect against catastrophic loss of the primary facility and the
data contained therein.

Here in the midwest, datacenters are typically built in building
basements or annexes and are fireproofed, as fire is the only facility
threat.  Fireproofing is much more cost effective than the myriad things
required for site replication and rebuilding a primary site after loss
due to fire.

> just a matter of the bandwidth and whether it
> will be sufficient. I'll test beforehand by using either the switch or
> linux to configure a slower connection (maybe 7M or something), and see
> if it will work reasonably.

I highly recommend you work through the business case for off site
replication/DR before embarking down this path.

> I've preferred AMD for years, but my supplier always prefers Intel, and

Of course, they're an IPD - Intel Product Dealer.  They get kickbacks
and freebies from the IPD program, including free product samples,
advance prototype products, promotional materials, and, the big one, 50%
of the cost of print, radio, and television ads that include Intel
product and the logo/jingle.  You also get tiered pricing depending on
volume, no matter how small your shop is.  And of course Intel holds IPD
events in major cities, with free food, drawings, door prizes, etc.  I
won a free CPU (worth $200 retail at the time) at the first one I
attended.  I know all of this in detail because I was the technician of
record when the small company I was working for signed up for the IPD
program.  Intel sells every part needed to built a server or workstation
but for the HDD and memory.  If a shop stocks/sells all Intel parts
program points add up more quickly.  In summary, once you're an IPD,
there is disincentive to sell anything else, especially if you're a
small shop.

> for systems like this they get much better warranty support for Intel
> compared to almost any other brand, so I generally end up with Intel
> boards and CPU's for "important" servers... Always heard good things
> about Intel NIC's ...

So you're at the mercy of your supplier.

FYI.  Intel has a tighter relationship with SuperMicro than any mobo
manufacturer.  For well over a decade Intel has tapped SM to build all
Intel prototype boards as Intel doesn't have a prototyping facility.
And SM contract manufacturers over 50% of all Intel motherboards.  Mobo
mf'ing is a low margin business compared to chip fab'ing, which is why
Intel never built up a large mobo mf'ing capability.  The capital cost
for the robots used in mobo making is as high as CPU building equipment,
but the profit per unit is much lower.

This relationship is the reason Intel was so upset when SM started
offering AMD server boards some years ago, and why at that time one had
to know the exact web server subdir in which to find the AMD
products--SM was hiding them for fear of Chipzilla's wrath.  Intel's
last $1.25B antitrust loss to AMD in '09 emboldened SM to bring their
AMD gear out of hiding and actually promote it.

In short, when you purchase a SuperMicro AMD based server board the
quality and compatibility is just as high as when buying a board with
Intel's sticker on it.  And until Sandy Bridge CPUs hit, you got far
more performance from the AMD solution as well.

-- 
Stan


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance
  2013-02-21  0:45                                                           ` Stan Hoeppner
@ 2013-02-21  3:10                                                             ` Adam Goryachev
  2013-02-22 11:19                                                               ` Stan Hoeppner
  0 siblings, 1 reply; 131+ messages in thread
From: Adam Goryachev @ 2013-02-21  3:10 UTC (permalink / raw)
  To: stan; +Cc: Dave Cundiff, linux-raid

On 21/02/13 11:45, Stan Hoeppner wrote:
> On 2/20/2013 10:45 AM, Adam Goryachev wrote:
>> Stan Hoeppner <stan@hardwarefreak.com> wrote:
>> Same ssd in both tests. fio command line was just fio test.fio
>> The fio file was the one posted in this thread by another user as follows:
>> [global]
>> bs=64k
>> ioengine=libaio
>> iodepth=32
> Try dropping the iodepth to 4 and see what that does.
>
>> size=4g
>> direct=1
>> runtime=60
>> #directory=/dev/vg0/testlv
>> filename=/tmp/testing/test
>>
>> [seq-read]
>> rw=read
>> stonewall
>>
>> [seq-write]
>> rw=write
>> stonewall
>>
>> Note, the "root ssd" is the /tmp/testing/test file, when testing MD
>> performance on the RAID5 I'm using the /dev/vg0/testlv which is an LV on
>> the DRBD on the RAID5 (md2), and I do the test with the DRBD disconnected.
> Yes, and FIO performance to a file is going to be limited by the
> filesystem, and specifically the O_DIRECT implementation in that FS.
> You may see significantly different results from EXT2/3/4, Reiser, than
> from XFS or JFS, and from different kernel and libaio versions as well.
>  There are too many layers between FIO and the block device, so it's
> difficult to get truly accurate performance data for the underlying device.
Not a problem, the root SSD is not really in question here, it is not
relevant to the system performance overall, it was just as a comparative
value...
> And in reality, all of your FIO testing should be random, not
> sequential, as your workload is completely random--this is a block IO
> server after all with 8 client hosts and I assume hundreds of users.
Is there a way to tell FIO to do random read/write tests?
> The proper way to test the capability of your iSCSI target server is to
> fire up 8 concurrent FIO tests, one on each Xen box (or VM), each
> running 8 threads and using random read/write IO, with each hitting a
> different test file residing on a different LUN, while using standard OS
> buffered IO.  Run a timed duration test of say 15 seconds.
At this point, we were trying to test the performance of the RAID5. If
the RAID5 is not performing at expected levels, then testing at a higher
level is not going to improve things. Unfortunately, testing at the LV
level is as "low" in the stack I can get without wiping the contents...
> Testing raw sequential throughput of a device (single SSD or single LUN
> atop a single LV on a big mdRAID device) is not informative at all.
Except to say that this is the maximum achievable performance we should
expect in ideal conditions. The fact is that one of the actual use cases
is a large streaming read concurrent to a large streaming write. I'd say
a single large streaming write or read (one at a time) are both relevant
tests towards that goal, given we have underlying SSD, and therefore
there are no intervening seeks between each read/write...

You can stop reading here if you are only interested in the question of
performance...

>>> Definitely cheaper, and more flexible should you need to run a filer
>>> (Samba) directly on the box.  Not NEARLY as easy to setup.  Nexsan has
>>> some nice gear that's a breeze to configure, nice intuitive web GUI.
>> The breeze to configure part would be nice :)
> When you're paying that much premium it better come with some good value
> added features.
That is what I had expected from the overland device, and initially it
was easy to configure/etc... Just obviously a touch buggy, probably in
the newer versions they have fixed those bugs, but I don't think I'll be
going back there again for a while...
>> We have 10Mbps private connection. I think we can license the DRBD proxy
>> which should handle the sync over a slower network. The main issue with
>> DRBD is when you are not using the DRBD proxy.... The connection itself
>> is very reliable though, 
> 10Mbps isn't feasible for 2nd site block level replication, with DRBD
> proxy or otherwise.  It's probably not even feasible for remote file
> based backup.
>
> BTW, what is the business case driver here for off site replication, and
> what is the distance to the replication site?  What is the threat
> profile to the primary infrastructure?  Earthquake?  Tsunami?  Flash
> flooding?  Tornado?  Fire?
It's about 22KM away (13.7 miles)

I suppose it is supposed to protect against the threat of fire or theft
at the primary location, and/or other localised events (such as the
exchange burning down, or extended power outage, etc). It's may or may
not be sufficient to protect against more widespread issues such as
earthquake/etc.

The client has an office in most states in Australia, but the offices in
the remote states don't have sufficient bandwidth at this stage for this
to make sense, and there is no local IT knowledge anyway. Considering
that every remote office is dependant on the main office for all their
IT systems, it makes sense to put some sort of attempt at a disaster
recovery plan, also considering the costs of doing this are minimal
(re-use existing spare equipment). Of course, if it works, and works
well, then other things may be done in the future to move this to a site
further away (another state/etc), but initially it is better to have it
relatively close so that issues can be resolved fairly easily.
> I've worked for and done work for businesses of all sizes, the largest
> being Mastercard.  Not a one had offsite replication, and few did
> network based offsite backup.  Those doing offsite other than network
> based performed tape rotation to vault services.
>
> That said, I'm in the US midwest.  And many companies on the East and
> West coasts do replication to facilities here.  Off site
> replication/backup only makes since when the 2nd facility is immune to
> all natural disasters, and hardened against all types of man made ones.
>  If you replicate to a site in the same city or region with the same
> thread profile, you're pissing in the wind.
You can call it any number of things, including pissing in the wind, but
sometimes it just makes life easier when doing a tender/proposal for a
prospective client to tick the box "do you have a disaster recovery
plan, does it include offsite/remote computer facilities/whatever... A
lot of these are government or corporate tenders where in reality, it
would never make a difference, but they feel like they need to ask, and
saying no gives a competitor some advantage.
> Off site replication/backup exists for a singular purpose:
>
> To protect against catastrophic loss of the primary facility and the
> data contained therein.
>
> Here in the midwest, datacenters are typically built in building
> basements or annexes and are fireproofed, as fire is the only facility
> threat.  Fireproofing is much more cost effective than the myriad things
> required for site replication and rebuilding a primary site after loss
> due to fire.
The main aim would be to allow recovery from a localised disaster for
all the remote offices, while head office might get trashed, at least if
the remote offices can continue with business as usual, then there is
still an income/etc. If there was a real disaster
(earthquake/flooding/etc) then they are not likely to be doing much
business in the short term, but soon after the recovery, they would want
to be ready to provide services, whether that is by one of the remote
offices handling the workload, etc.

As I've discussed with other clients, especially where they only have a
single office, and all staff live locally, having an inter-state
disaster recovery centre is pretty useless, since with that level of
disaster you will all be dead anyway, so who really cares :) ie, if a
nuclear weapon is detonated in my city, my customers won't be alive to
call me, and I won't be alive to deal with it, and their customers won't
be alive/etc (ie, does your local butcher need offsite
backup/replication...)
>> just a matter of the bandwidth and whether it
>> will be sufficient. I'll test beforehand by using either the switch or
>> linux to configure a slower connection (maybe 7M or something), and see
>> if it will work reasonably.
> I highly recommend you work through the business case for off site
> replication/DR before embarking down this path.
As mentioned, the business case is minimal, which is why the budget for
it is minimal. If it can't be achieved with minimal (basically nil)
expenditure, then it will be delayed for another day. However, having it
will mostly benefit by adding something else for the salespeople to talk
about more than the actual functionality/effectiveness.

At worst, having DRBD simply re-sync each night would provide adequate
advantage/protection.

I think a 10M connection should be capable of re-syncing around 40G of
data per night, and in a day the DRBD only needs to write a max of 20G,
so hopefully this will be feasible. I'm hoping that some of those writes
are caused by me doing testing/etc, so real world writes will be even
less. (Note, on the year to date stats, max needed to write is 442G, but
that would include system migrations/etc). At the end of the day, you
may be right and the 10M is insufficient to get this done, in which case
we will need to make the business case to either upgrade the bandwidth
further (possibly all the way to 100M), or else to forget the idea
entirely. (Sure all this could be helped if we get rid of MS Outlook,
and it's 3 to 10 GB pst files, but that is probably just a dream).
>> I've preferred AMD for years, but my supplier always prefers Intel, and
> Of course, they're an IPD - Intel Product Dealer.  They get kickbacks
> and freebies from the IPD program, including free product samples,
> advance prototype products, promotional materials, and, the big one, 50%
> of the cost of print, radio, and television ads that include Intel
> product and the logo/jingle.  You also get tiered pricing depending on
> volume, no matter how small your shop is.  And of course Intel holds IPD
> events in major cities, with free food, drawings, door prizes, etc.  I
> won a free CPU (worth $200 retail at the time) at the first one I
> attended.  I know all of this in detail because I was the technician of
> record when the small company I was working for signed up for the IPD
> program.  Intel sells every part needed to built a server or workstation
> but for the HDD and memory.  If a shop stocks/sells all Intel parts
> program points add up more quickly.  In summary, once you're an IPD,
> there is disincentive to sell anything else, especially if you're a
> small shop.
I didn't know all that, but it was definitely a large part of my
preference. I really dislike corporations that behave badly, and like to
support the better behaved corporations where possible (ie, basic
business sense still plays a part, but I'd rather pay 5% more for AMD
for example).
>> for systems like this they get much better warranty support for Intel
>> compared to almost any other brand, so I generally end up with Intel
>> boards and CPU's for "important" servers... Always heard good things
>> about Intel NIC's ...
> So you're at the mercy of your supplier.
>
> FYI.  Intel has a tighter relationship with SuperMicro than any mobo
> manufacturer.  For well over a decade Intel has tapped SM to build all
> Intel prototype boards as Intel doesn't have a prototyping facility.
> And SM contract manufacturers over 50% of all Intel motherboards.  Mobo
> mf'ing is a low margin business compared to chip fab'ing, which is why
> Intel never built up a large mobo mf'ing capability.  The capital cost
> for the robots used in mobo making is as high as CPU building equipment,
> but the profit per unit is much lower.
>
> This relationship is the reason Intel was so upset when SM started
> offering AMD server boards some years ago, and why at that time one had
> to know the exact web server subdir in which to find the AMD
> products--SM was hiding them for fear of Chipzilla's wrath.  Intel's
> last $1.25B antitrust loss to AMD in '09 emboldened SM to bring their
> AMD gear out of hiding and actually promote it.
>
> In short, when you purchase a SuperMicro AMD based server board the
> quality and compatibility is just as high as when buying a board with
> Intel's sticker on it.  And until Sandy Bridge CPUs hit, you got far
> more performance from the AMD solution as well.
All very interesting information, thanks for sharing, I'll keep it in
mind on my next system spec.

Regards,
Adam

-- 
Adam Goryachev
Website Managers
www.websitemanagers.com.au


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance - new kernel results - 5x SSD RAID5
  2013-02-20 17:10                 ` Adam Goryachev
@ 2013-02-21  6:04                   ` Stan Hoeppner
  2013-02-21  6:40                     ` Adam Goryachev
  2013-02-21 17:41                     ` RAID performance - new kernel results - 5x SSD RAID5 David Brown
  0 siblings, 2 replies; 131+ messages in thread
From: Stan Hoeppner @ 2013-02-21  6:04 UTC (permalink / raw)
  To: Adam Goryachev; +Cc: Dave Cundiff, linux-raid

On 2/20/2013 11:10 AM, Adam Goryachev wrote:
> On 19/02/13 00:20, Stan Hoeppner wrote:
>> On 2/17/2013 3:52 AM, Adam Goryachev wrote:
>>
>>>    READ: io=4096MB, aggrb=2242MB/s, minb=2296MB/s, maxb=2296MB/s,
>>> mint=1827msec, maxt=1827msec
>>
>>>   WRITE: io=4096MB, aggrb=560660KB/s, minb=574116KB/s, maxb=574116KB/s,
>>> mint=7481msec, maxt=7481msec
>>
>> Our read throughput is almost exactly 4x the write throughput.  At the
>> hardware level, single SSD write throughput should only be ~10% lower
>> than read.  Sequential writes w/RAID5 should not cause RMW cycles so
>> that is not in play in these tests.  So why are writes so much slower?
>> Knowing these things, where should we start looking for our performance
>> killing needle in this haystack?
>>
>> We know that the md/RAID5 driver still uses a single write thread in
>> kernel 3.2.35.  And given we're pushing over 500MB/s through md/RAID5 to
>> SSD storage, it's possible that this thread is eating all of one CPU
>> core with both IOs and parity calculations, limiting write throughput.
>> So that's the first place to look.  For your 7 second test run of FIO,
>> we could do some crude instrumenting.  Assuming you have top setup to
>> show individual Cpus (if not hit '1' in interactive mode to get them,
>> then exit), we can grab top output twice a seconds for 10 seconds, in
>> another terminal window.  So we do something like the following, giving
>> 3 seconds to switch windows and launch FIO.  (Or one could do it in  a
>> single window, writing a script to pipe the output of each to a file)
>>
>> ~$ top -b -n 20 -d 0.5 |grep Cpu
>>
>> yields 28 lines of this for 2 cores, 56 lies for 4 cores.
>>
>> Cpu0 : 1.2%us, 0.5%sy, 1.8%ni, 96.4%id, 0.1%wa, 0.0%hi, 0.0%si, 0.0%st
>> Cpu1 : 1.1%us, 0.5%sy, 2.2%ni, 96.1%id, 0.1%wa, 0.0%hi, 0.0%si, 0.0%st
>> Cpu0 : 1.9%us, 1.9%sy, 0.0%ni, 96.2%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
>> Cpu1 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
>>
>> This will give us a good idea of what the cores are doing during the FIO
>> run, as well as interrupt distribution, which CPUs are handling the
>> lower level IO threads, how long we're waiting on the SSDs, etc.  If any
>> core is at 98%+ during the run then md thread starvation is the problem.
> 
> Didn't quite work, I had to run the top command like this:
> top -n20 -d 0.5 | grep Cpu
> Then press 1 after it started, it didn't save the state when running it
> interactively and then exiting.

Simply reading 'man top' tells you that hitting 'w' writes the change.
As you didn't have the per CPU top layout previously, I can only assume
you don't use top very often, if at all.  top is a fantastic diagnostic
tool when used properly.  Learn it, live it, love it. ;)

> Output is as follows:
...
> Cpu0  :  0.0%us, 47.9%sy,  0.0%ni, 50.0%id,  0.0%wa,  0.0%hi,  2.1%si,
> 0.0%st
> Cpu1  :  0.0%us,  2.0%sy,  0.0%ni, 98.0%id,  0.0%wa,  0.0%hi,  0.0%si,
> 0.0%st
> Cpu2  :  2.0%us, 35.3%sy,  0.0%ni,  0.0%id, 62.7%wa,  0.0%hi,  0.0%si,
> 0.0%st
> Cpu3  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,
> 0.0%st
> Cpu4  :  0.0%us,  3.8%sy,  0.0%ni, 96.2%id,  0.0%wa,  0.0%hi,  0.0%si,
> 0.0%st
> Cpu5  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,
> 0.0%st
> Cpu6  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,
> 0.0%st
> Cpu7  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,
> 0.0%st

With HT, this output for 8 "cpus" and line wrapping, it's hard to make
heads/tails.  I see in your header you use Thundebird 17 as I do.  Did
you notice my formatting of top output wasn't wrapped?  To fix the
wrapping, after you paste it into the compose windows, select it all,
then click Edit-->Rewrap.  And you get this:

Cpu0 : 1.1%us, 0.5%sy, 1.8%ni, 96.5%id, 0.1%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu1 : 1.1%us, 0.5%sy, 2.2%ni, 96.1%id, 0.1%wa, 0.0%hi, 0.0%si, 0.0%st

instead of this:

Cpu0  :  1.1%us,  0.5%sy,  1.8%ni, 96.5%id,  0.1%wa,  0.0%hi,  0.0%si,
0.0%st
Cpu1  :  1.1%us,  0.5%sy,  2.2%ni, 96.1%id,  0.1%wa,  0.0%hi,  0.0%si,
0.0%st

> There was more, very similar figures... apart from the second sample
> above, there was never a single Cpu with close to 0% Idle, and I'm
> assuming the %CPU in wa state is basically "idle" waiting for the disk
> or something else to happen rather than the CPU actually being busy...

We're looking for a pegged CPU, not idle ones.  Most will be idle, or
should be idle, as this is a block IO server.  And yes, %wa means the
CPU is waiting on an IO device.  With 5 very fast SSDs in RAID5, we
shouldn't be seeing much %wa.  And during a sustained streaming write, I
would expect to see one CPU core pegged at 99% for the duration of the
FIO run, or close to it.  This will be the one running the mdraid5 write
thread.  If we see something other than this, such as heavy %wa, that
may mean there's something wrong elsewhere in the system, either
kernel/parm, or hardware.

>> (If you have hyperthreading enabled, reboot and disable it.  It normally
>> decreases thread performance due to scheduling and context switching
>> overhead, among other things.  Not to mention it makes determining
>> actual CPU load more difficult.  In this exercise you'll needlessly have
>> twice as many lines of output to comb through.)
> 
> I'll have to go in after hours to do that. Hopefully over the weekend
> (BIOS setting and no remote KVM)... Can re-supply the results after that
> if you think it will make a difference.

FYI for future Linux server deployments, it's very rare that a server
workload will run better with HT enabled.  In fact they most often
perform quite a bit worse with HT enabled.  The ones that may perform
better are those such as IMAP servers with hundreds or thousands of user
processes, most sitting idle, or blocking on IO.  For a block IO server
with very few active processes, and processes that need all possible CPU
bandwidth for short intervals (mdraid5 write thread), HT reduces CPU
bandwidth due to switching between two hardware threads on one core.

Note that Intel abandoned HT with the 'core' series of CPUs, and
reintroduced it with the Nehalem series.  AMD has never implemented HT
(SMT) it its CPUs.  And if you recall Opterons beat the stuffing out of
Xeons for many, many years.

>> Again, these numbers assume the low write performance is due to thread
>> starvation.
> 
> I don't think it is from my measurements...

It may not be but it's too early to tell.  After we have some readable
output we'll be able to discern more.  It may simply be that you're
re-writing the same small 15GB section of the SSDs, causing massive
garbage collection, which in turn causes serious IO delays.  This is one
of the big downsides to using SSDs as SAN storage and carving it up into
small chunks.  The more you write large amounts to small sections, the
more GC kicks in to do wear leveling.  With rust you can overwrite the
same section of a platter all day long and the performance doesn't change.

>> The downside for both:  Neither of these configurations can be expanded
>> with a reshape and thus drives cannot be added.  That can be achieved by
>> using a linear layer atop these RAID0 devices, and adding new md devices
>> to the linear array later.  With this you don't get automatic even
>> distribution of IO for the linear array, but only for the constituent
>> striped arrays.  This isn't a bad tradeoff when IO flow analysis and
>> architectural planning are performed before a system is deployed.
> 
> I'll disable the hyperthreading, and re-test afterwards, but I'm not
> sure that will produce much of a result. 

Whatever the resulting data, it should help point us to the cause of the
write performance problem, whether it's CPU starvation of the md write
thread, or something else such as high IO latency due to something like
I described above, or something else entirely, maybe the FIO testing
itself.  We know from other peoples' published results that these Intel
520s SSDs are capable of seq write performance of 500MB/s with a queue
depth greater than 2.  You're achieving full read bandwidth, but only
1/3rd the write bandwidth.  Work with me and we'll get it figured out.

> Let me know if you think I
> should run any other tests to track it down...

Can't think of any at this point.  Any further testing will depend on
the results of good top output from the next FIO run.  Were you able to
get all the SSD partitions starting at a sector evenly divisible by 512
bytes yet?  That may be of more benefit than any other change.  Other
than testing on something larger than a 15GB LV.

> One thing I can see is a large number of interrupts and context switches
> which looks like it happened at the same time as a backup run. Perhaps I
> am getting too many interrrupts on the network cards or the SATA controller?

If cpu0 isn't peaked your interrupt load isn't too high.  Regardless,
installing irqbalance is a good idea for a multicore iSCSI server with 2
quad port NICs and a high IOPS SAS controller with SSDs attached.  This
system is the poster boy for irqbalance.  As the name implies, the
irqbalance daemon spreads the interrupt load across many cores.  Intel
systems by default route all interrupts to core0.  The 0.56 version in
Squeeze I believe does static IRQ routing, each device's (HBA)
interrupts are routed to a specific core based on discovery.  So, say,
LSI routes to core1, NIC1 to core2, NIC2 to core 3.  So you won't get an
even spread, but at least core0 is no longer handling the entire
interrupt load.  Wheezy ships with 1.0.3 which does dynamic routing, so
on heavily loaded systems (this one is actually not) the spread is much
more even.

WRT context switches, you'll notice this drop substantially after
disabling HT.  And if you think this value is high, compare it to one of
the Terminal Services Xen boxen.  Busy hypervisors and terminal servers
generate the most CS/s of any platform, by far, and you've got both on a
single box.

-- 
Stan


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance - new kernel results - 5x SSD RAID5
  2013-02-21  6:04                   ` Stan Hoeppner
@ 2013-02-21  6:40                     ` Adam Goryachev
  2013-02-21  8:47                       ` Joseph Glanville
  2013-02-22  8:10                       ` Stan Hoeppner
  2013-02-21 17:41                     ` RAID performance - new kernel results - 5x SSD RAID5 David Brown
  1 sibling, 2 replies; 131+ messages in thread
From: Adam Goryachev @ 2013-02-21  6:40 UTC (permalink / raw)
  To: stan; +Cc: Dave Cundiff, linux-raid

On 21/02/13 17:04, Stan Hoeppner wrote:
> Simply reading 'man top' tells you that hitting 'w' writes the change.
> As you didn't have the per CPU top layout previously, I can only assume
> you don't use top very often, if at all.  top is a fantastic diagnostic
> tool when used properly.  Learn it, live it, love it. ;)

haha, yes, I do use top a lot, but I guess I've never learned it very
well. Everything I know about linux has been self-learned, and I guess
until I have a problem, or a need, then I don't tend to learn about it.
I've mostly worked for ISP's as linux sysadmin for the past 16 years or
so....

>> Output is as follows:
> With HT, this output for 8 "cpus" and line wrapping, it's hard to make
> heads/tails.  I see in your header you use Thundebird 17 as I do.  Did
> you notice my formatting of top output wasn't wrapped?  To fix the
> wrapping, after you paste it into the compose windows, select it all,
> then click Edit-->Rewrap.  And you get this:

Funny, I never thought to use that feature like that. For me, I only
ever used it to help line wrap really long lines that were quoted from
someone else email. Didn't know it could make my lines longer (without
manually adjusting the global linewrap character count). Thanks for
another useful tip :)

I'll repost numbers after I disable HT, no point right now.

> We're looking for a pegged CPU, not idle ones.  Most will be idle, or
> should be idle, as this is a block IO server.  And yes, %wa means the
> CPU is waiting on an IO device.  With 5 very fast SSDs in RAID5, we
> shouldn't be seeing much %wa.  And during a sustained streaming write, I
> would expect to see one CPU core pegged at 99% for the duration of the
> FIO run, or close to it.  This will be the one running the mdraid5 write
> thread.  If we see something other than this, such as heavy %wa, that
> may mean there's something wrong elsewhere in the system, either
> kernel/parm, or hardware.

Yes, I'm quite sure that there was no CPU with close to 0% idle (or
100%sy) for the duration of the test. In any case, I'll re-run the test
and advise in a few days.

> FYI for future Linux server deployments, it's very rare that a server
> workload will run better with HT enabled.  In fact they most often
> perform quite a bit worse with HT enabled.  The ones that may perform
> better are those such as IMAP servers with hundreds or thousands of user
> processes, most sitting idle, or blocking on IO.  For a block IO server
> with very few active processes, and processes that need all possible CPU
> bandwidth for short intervals (mdraid5 write thread), HT reduces CPU
> bandwidth due to switching between two hardware threads on one core.
> 
> Note that Intel abandoned HT with the 'core' series of CPUs, and
> reintroduced it with the Nehalem series.  AMD has never implemented HT
> (SMT) it its CPUs.  And if you recall Opterons beat the stuffing out of
> Xeons for many, many years.

Yes, and I truly loved telling customers that AMD CPU's were both
cheaper AND better performing. Those were amazing days for AMD. To be
honest, I don't read enough about CPU's anymore, but my understanding is
that AMD are a little behind on the performance curve, but not far
enough that I wouldn't want to use them....

>> I don't think it is from my measurements...
> 
> It may not be but it's too early to tell.  After we have some readable
> output we'll be able to discern more.  It may simply be that you're
> re-writing the same small 15GB section of the SSDs, causing massive
> garbage collection, which in turn causes serious IO delays.  This is one
> of the big downsides to using SSDs as SAN storage and carving it up into
> small chunks.  The more you write large amounts to small sections, the
> more GC kicks in to do wear leveling.  With rust you can overwrite the
> same section of a platter all day long and the performance doesn't change.

True, I can allocate a larger LV for testing (I think I have around 500G
free at the moment, just let me know what size I should allocate/etc...)

> Whatever the resulting data, it should help point us to the cause of the
> write performance problem, whether it's CPU starvation of the md write
> thread, or something else such as high IO latency due to something like
> I described above, or something else entirely, maybe the FIO testing
> itself.  We know from other peoples' published results that these Intel
> 520s SSDs are capable of seq write performance of 500MB/s with a queue
> depth greater than 2.  You're achieving full read bandwidth, but only
> 1/3rd the write bandwidth.  Work with me and we'll get it figured out.

Sounds good, thanks.

>> Let me know if you think I
>> should run any other tests to track it down...
> 
> Can't think of any at this point.  Any further testing will depend on
> the results of good top output from the next FIO run.  Were you able to
> get all the SSD partitions starting at a sector evenly divisible by 512
> bytes yet?  That may be of more benefit than any other change.  Other
> than testing on something larger than a 15GB LV.

All drives now look like this (fdisk -ul)
Disk /dev/sdb: 480 GB, 480101368320 bytes
255 heads, 63 sectors/track, 58369 cylinders, total 937697985 sectors
Units = sectors of 1 * 512 = 512 bytes

Device Boot Start End Blocks Id System
/dev/sdb1 64 931770000 465893001 fd Lnx RAID auto
Warning: Partition 1 does not end on cylinder boundary.

I think (from the list) that this should now be correct...

>> One thing I can see is a large number of interrupts and context switches
>> which looks like it happened at the same time as a backup run. Perhaps I
>> am getting too many interrrupts on the network cards or the SATA controller?
> 
> If cpu0 isn't peaked your interrupt load isn't too high.  Regardless,
> installing irqbalance is a good idea for a multicore iSCSI server with 2
> quad port NICs and a high IOPS SAS controller with SSDs attached.  This
> system is the poster boy for irqbalance.  As the name implies, the
> irqbalance daemon spreads the interrupt load across many cores.  Intel
> systems by default route all interrupts to core0.  The 0.56 version in
> Squeeze I believe does static IRQ routing, each device's (HBA)
> interrupts are routed to a specific core based on discovery.  So, say,
> LSI routes to core1, NIC1 to core2, NIC2 to core 3.  So you won't get an
> even spread, but at least core0 is no longer handling the entire
> interrupt load.  Wheezy ships with 1.0.3 which does dynamic routing, so
> on heavily loaded systems (this one is actually not) the spread is much
> more even.

OK, currently all IRQ's are on CPU0 (/proc/interrupts). I've installed
irqbalance, and it has already started to spread interrupts across the
CPU's. I am pretty sure I started doing some irq balancing a few months
ago, but I was doing it manually, and set the onboard SATA to one CPU,
each pair of ethernet ports to another, and everything else to the last.
I tried to skip the HT CPU's. I think this is going to be a better
solution, especially once I disable HT.

> WRT context switches, you'll notice this drop substantially after
> disabling HT.  And if you think this value is high, compare it to one of
> the Terminal Services Xen boxen.  Busy hypervisors and terminal servers
> generate the most CS/s of any platform, by far, and you've got both on a
> single box.

Speaking of which, I've found another few issues that are not related to
the RAID write speed, but may be related to the end user experience.

Tonight, I will increase each xen physical box from having 1 CPU pinned,
to having 2 CPU's pinned.

The Domain Controller/file server (windows 2000) is configured for 2
vCPU, but is only using one since windows itself is not setup for
multiple CPU's. I'll change the windows driver and in theory this should
allow dual CPU support.

Generally speaking, complaints have settled down, and I think most users
are basically happy. I've still had a few users with "outlook crashing",
and I've now seen that usually the PST file is corrupt. I'm hopeful that
running the scanpst tool will fix the corruptions and stop the outlook
crashes. In addition, I've found the user with the biggest complaints
about performance has a 9GB pst file, so a little pruning will improve
that I suspect.

So, I think between the above couple of things, and all the other work
already done, the customer is relatively comfortable (I won't say happy,
but maybe if we can survive a few weeks without any disaster...).
Personally, I'd like to improve the RAID performance, just because it
should, but at least I can relax a little, and dedicate some time to
other jobs, etc...

So, summary:
1) Disable HT
2) Increase test LV to 100G
3) Re-run fio test
4) Re-collect CPU stats

Sound good?

Thanks,
Adam

-- 
Adam Goryachev
Website Managers
www.websitemanagers.com.au

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance - new kernel results - 5x SSD RAID5
  2013-02-21  6:40                     ` Adam Goryachev
@ 2013-02-21  8:47                       ` Joseph Glanville
  2013-02-22  8:10                       ` Stan Hoeppner
  1 sibling, 0 replies; 131+ messages in thread
From: Joseph Glanville @ 2013-02-21  8:47 UTC (permalink / raw)
  To: Adam Goryachev; +Cc: stan, Dave Cundiff, linux-raid

On 21 February 2013 17:40, Adam Goryachev
<mailinglists@websitemanagers.com.au> wrote:
> On 21/02/13 17:04, Stan Hoeppner wrote:
>> Simply reading 'man top' tells you that hitting 'w' writes the change.
>> As you didn't have the per CPU top layout previously, I can only assume
>> you don't use top very often, if at all.  top is a fantastic diagnostic
>> tool when used properly.  Learn it, live it, love it. ;)
>
> haha, yes, I do use top a lot, but I guess I've never learned it very
> well. Everything I know about linux has been self-learned, and I guess
> until I have a problem, or a need, then I don't tend to learn about it.
> I've mostly worked for ISP's as linux sysadmin for the past 16 years or
> so....
>
>>> Output is as follows:
>> With HT, this output for 8 "cpus" and line wrapping, it's hard to make
>> heads/tails.  I see in your header you use Thundebird 17 as I do.  Did
>> you notice my formatting of top output wasn't wrapped?  To fix the
>> wrapping, after you paste it into the compose windows, select it all,
>> then click Edit-->Rewrap.  And you get this:
>
> Funny, I never thought to use that feature like that. For me, I only
> ever used it to help line wrap really long lines that were quoted from
> someone else email. Didn't know it could make my lines longer (without
> manually adjusting the global linewrap character count). Thanks for
> another useful tip :)
>
> I'll repost numbers after I disable HT, no point right now.
>
>> We're looking for a pegged CPU, not idle ones.  Most will be idle, or
>> should be idle, as this is a block IO server.  And yes, %wa means the
>> CPU is waiting on an IO device.  With 5 very fast SSDs in RAID5, we
>> shouldn't be seeing much %wa.  And during a sustained streaming write, I
>> would expect to see one CPU core pegged at 99% for the duration of the
>> FIO run, or close to it.  This will be the one running the mdraid5 write
>> thread.  If we see something other than this, such as heavy %wa, that
>> may mean there's something wrong elsewhere in the system, either
>> kernel/parm, or hardware.
>
> Yes, I'm quite sure that there was no CPU with close to 0% idle (or
> 100%sy) for the duration of the test. In any case, I'll re-run the test
> and advise in a few days.
>
>> FYI for future Linux server deployments, it's very rare that a server
>> workload will run better with HT enabled.  In fact they most often
>> perform quite a bit worse with HT enabled.  The ones that may perform
>> better are those such as IMAP servers with hundreds or thousands of user
>> processes, most sitting idle, or blocking on IO.  For a block IO server
>> with very few active processes, and processes that need all possible CPU
>> bandwidth for short intervals (mdraid5 write thread), HT reduces CPU
>> bandwidth due to switching between two hardware threads on one core.
>>
>> Note that Intel abandoned HT with the 'core' series of CPUs, and
>> reintroduced it with the Nehalem series.  AMD has never implemented HT
>> (SMT) it its CPUs.  And if you recall Opterons beat the stuffing out of
>> Xeons for many, many years.
>
> Yes, and I truly loved telling customers that AMD CPU's were both
> cheaper AND better performing. Those were amazing days for AMD. To be
> honest, I don't read enough about CPU's anymore, but my understanding is
> that AMD are a little behind on the performance curve, but not far
> enough that I wouldn't want to use them....
>
>>> I don't think it is from my measurements...
>>
>> It may not be but it's too early to tell.  After we have some readable
>> output we'll be able to discern more.  It may simply be that you're
>> re-writing the same small 15GB section of the SSDs, causing massive
>> garbage collection, which in turn causes serious IO delays.  This is one
>> of the big downsides to using SSDs as SAN storage and carving it up into
>> small chunks.  The more you write large amounts to small sections, the
>> more GC kicks in to do wear leveling.  With rust you can overwrite the
>> same section of a platter all day long and the performance doesn't change.
>
> True, I can allocate a larger LV for testing (I think I have around 500G
> free at the moment, just let me know what size I should allocate/etc...)
>
>> Whatever the resulting data, it should help point us to the cause of the
>> write performance problem, whether it's CPU starvation of the md write
>> thread, or something else such as high IO latency due to something like
>> I described above, or something else entirely, maybe the FIO testing
>> itself.  We know from other peoples' published results that these Intel
>> 520s SSDs are capable of seq write performance of 500MB/s with a queue
>> depth greater than 2.  You're achieving full read bandwidth, but only
>> 1/3rd the write bandwidth.  Work with me and we'll get it figured out.
>
> Sounds good, thanks.
>
>>> Let me know if you think I
>>> should run any other tests to track it down...
>>
>> Can't think of any at this point.  Any further testing will depend on
>> the results of good top output from the next FIO run.  Were you able to
>> get all the SSD partitions starting at a sector evenly divisible by 512
>> bytes yet?  That may be of more benefit than any other change.  Other
>> than testing on something larger than a 15GB LV.
>
> All drives now look like this (fdisk -ul)
> Disk /dev/sdb: 480 GB, 480101368320 bytes
> 255 heads, 63 sectors/track, 58369 cylinders, total 937697985 sectors
> Units = sectors of 1 * 512 = 512 bytes
>
> Device Boot Start End Blocks Id System
> /dev/sdb1 64 931770000 465893001 fd Lnx RAID auto
> Warning: Partition 1 does not end on cylinder boundary.
>
> I think (from the list) that this should now be correct...
>
>>> One thing I can see is a large number of interrupts and context switches
>>> which looks like it happened at the same time as a backup run. Perhaps I
>>> am getting too many interrrupts on the network cards or the SATA controller?
>>
>> If cpu0 isn't peaked your interrupt load isn't too high.  Regardless,
>> installing irqbalance is a good idea for a multicore iSCSI server with 2
>> quad port NICs and a high IOPS SAS controller with SSDs attached.  This
>> system is the poster boy for irqbalance.  As the name implies, the
>> irqbalance daemon spreads the interrupt load across many cores.  Intel
>> systems by default route all interrupts to core0.  The 0.56 version in
>> Squeeze I believe does static IRQ routing, each device's (HBA)
>> interrupts are routed to a specific core based on discovery.  So, say,
>> LSI routes to core1, NIC1 to core2, NIC2 to core 3.  So you won't get an
>> even spread, but at least core0 is no longer handling the entire
>> interrupt load.  Wheezy ships with 1.0.3 which does dynamic routing, so
>> on heavily loaded systems (this one is actually not) the spread is much
>> more even.
>
> OK, currently all IRQ's are on CPU0 (/proc/interrupts). I've installed
> irqbalance, and it has already started to spread interrupts across the
> CPU's. I am pretty sure I started doing some irq balancing a few months
> ago, but I was doing it manually, and set the onboard SATA to one CPU,
> each pair of ethernet ports to another, and everything else to the last.
> I tried to skip the HT CPU's. I think this is going to be a better
> solution, especially once I disable HT.
>
>> WRT context switches, you'll notice this drop substantially after
>> disabling HT.  And if you think this value is high, compare it to one of
>> the Terminal Services Xen boxen.  Busy hypervisors and terminal servers
>> generate the most CS/s of any platform, by far, and you've got both on a
>> single box.
>
> Speaking of which, I've found another few issues that are not related to
> the RAID write speed, but may be related to the end user experience.
>
> Tonight, I will increase each xen physical box from having 1 CPU pinned,
> to having 2 CPU's pinned.
>
> The Domain Controller/file server (windows 2000) is configured for 2
> vCPU, but is only using one since windows itself is not setup for
> multiple CPU's. I'll change the windows driver and in theory this should
> allow dual CPU support.
>
> Generally speaking, complaints have settled down, and I think most users
> are basically happy. I've still had a few users with "outlook crashing",
> and I've now seen that usually the PST file is corrupt. I'm hopeful that
> running the scanpst tool will fix the corruptions and stop the outlook
> crashes. In addition, I've found the user with the biggest complaints
> about performance has a 9GB pst file, so a little pruning will improve
> that I suspect.
>
> So, I think between the above couple of things, and all the other work
> already done, the customer is relatively comfortable (I won't say happy,
> but maybe if we can survive a few weeks without any disaster...).
> Personally, I'd like to improve the RAID performance, just because it
> should, but at least I can relax a little, and dedicate some time to
> other jobs, etc...
>
> So, summary:
> 1) Disable HT
> 2) Increase test LV to 100G
> 3) Re-run fio test
> 4) Re-collect CPU stats
>
> Sound good?
>
> Thanks,
> Adam
>
> --
> Adam Goryachev
> Website Managers
> www.websitemanagers.com.au
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Sorry to but in, but have you tried doing the tests beneath the DRBD layer?
DRBD is known for doing intersesting things to IOs and could be what
is now limiting performance.

I found when building fast SRP based SANs that using DRBD for
replication (even when not connected) dropped performance to less than
20% what the array is cappable of.
This may have changed since - I am talking a few years ago now when
DRBD was first merged into mainline.

It is safe to do reads on the raw md device as long as you don't have
fio configured to do writes you won't hurt anything.

Joseph.

-- 
CTO | Orion Virtualisation Solutions | www.orionvm.com.au
Phone: 1300 56 99 52 | Mobile: 0428 754 846

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance - new kernel results - 5x SSD RAID5
  2013-02-21  6:04                   ` Stan Hoeppner
  2013-02-21  6:40                     ` Adam Goryachev
@ 2013-02-21 17:41                     ` David Brown
  2013-02-23  6:41                       ` Stan Hoeppner
  1 sibling, 1 reply; 131+ messages in thread
From: David Brown @ 2013-02-21 17:41 UTC (permalink / raw)
  To: stan; +Cc: Adam Goryachev, Dave Cundiff, linux-raid

On 21/02/13 07:04, Stan Hoeppner wrote:
> FYI for future Linux server deployments, it's very rare that a server
> workload will run better with HT enabled.  In fact they most often
> perform quite a bit worse with HT enabled.  The ones that may perform
> better are those such as IMAP servers with hundreds or thousands of user
> processes, most sitting idle, or blocking on IO.  For a block IO server
> with very few active processes, and processes that need all possible CPU
> bandwidth for short intervals (mdraid5 write thread), HT reduces CPU
> bandwidth due to switching between two hardware threads on one core.
>
> Note that Intel abandoned HT with the 'core' series of CPUs, and
> reintroduced it with the Nehalem series.  AMD has never implemented HT
> (SMT) it its CPUs.  And if you recall Opterons beat the stuffing out of
> Xeons for many, many years.
>

It is worth noting here that there are very different implementations of 
HT.  Intel's first HT processors had severe problems with the cost of 
context switches, and most loads performed better with HT disabled.  But 
these days are long gone - the current HT processors are much more 
effective.  On loads where there is a fair amount of processing going 
on, then HT can help significantly.  If you are IO or memory bound, of 
course, then HT will not help at all - and even the modern cheaper 
context switches are not free, and may reduce the overall performance.

Also remember that when HT was first introduced in x86 cpus, OS's (Linux 
and other unmentionable OS's) were not optimised for them - they treated 
the fake cores like real ones.  These days Linux makes a distinction and 
uses the fake cores appropriately.

HT may not be of help in a pure file server setup, but in many other 
server applications such as web servers (and IMAP, as you mentioned), HT 
is a huge benefit.  It is not coincidence that the big server cpu 
architectures (MIPS, Power, SPARC) all use 2 or even 4 time SMT.

I have no numbers of my own to back this up - but I would certainly not 
consider disabling HT on a server without very concrete reasoning.

(I too was a great fan of AMD, and used them almost exclusively until 
the Core 2 architecture from Intel.  And while I am glad that Intel have 
made very nice chips in recent years, I think it is a shame that they 
did so using ideas copied directly from AMD - and AMD can no longer 
seriously compete.)


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance - new kernel results - 5x SSD RAID5
  2013-02-21  6:40                     ` Adam Goryachev
  2013-02-21  8:47                       ` Joseph Glanville
@ 2013-02-22  8:10                       ` Stan Hoeppner
  2013-02-24 20:36                         ` Stan Hoeppner
  1 sibling, 1 reply; 131+ messages in thread
From: Stan Hoeppner @ 2013-02-22  8:10 UTC (permalink / raw)
  To: Adam Goryachev; +Cc: Dave Cundiff, linux-raid

On 2/21/2013 12:40 AM, Adam Goryachev wrote:
...
> True, I can allocate a larger LV for testing (I think I have around 500G
> free at the moment, just let me know what size I should allocate/etc...)

Before you change your test LV size, do the following:

1.  Make sure stripe_cache_size is as least 8192.  If not:
    ~$ echo 8192 > /sys/block/md0/md/stripe_cache_size
    To make this permanent, add the line to /etc/rc.local

2.  Run fio using this config file and post the results:

[global]
filename=/dev/vg0/testlv (assuming this is still correct)
zero_buffers
numjobs=16
thread
group_reporting
blocksize=256k
ioengine=libaio
iodepth=16
direct=1
size=8g

[read]
rw=randread
stonewall

[write]
rw=randwrite
stonewall

...
> Device Boot Start End Blocks Id System
> /dev/sdb1 64 931770000 465893001 fd Lnx RAID auto
> Warning: Partition 1 does not end on cylinder boundary.
> 
> I think (from the list) that this should now be correct...

Start sector is 64.  That should do it I think.

...
> Tonight, I will increase each xen physical box from having 1 CPU pinned,
> to having 2 CPU's pinned.

I'm not familiar with Xen "pinning".  Do you mean you had less than 6
cores available to each Windows TS VM?  Given that running TS/Citrix
inside a VM is against every BCP due to context switching overhead, you
should make all 6 cores available to each TS VM all the time, if Xen
allows it.  Otherwise you're perennially wasting core cycles that would
benefit user sessions, which could be making everything faster, more
responsive, for everyone.

> The Domain Controller/file server (windows 2000) is configured for 2
> vCPU, but is only using one since windows itself is not setup for
> multiple CPU's. I'll change the windows driver and in theory this should
> allow dual CPU support.

It probably won't make much, if any, difference for this VM.  But if the
box has 6 cores and only one is actually being used, it certainly can't
hurt.

> Generally speaking, complaints have settled down, and I think most users
> are basically happy. I've still had a few users with "outlook crashing",
> and I've now seen that usually the PST file is corrupt. I'm hopeful that

Their .PST files reside on a share on the DC, correct?  And one is 9GB
in size?  I had something really humorous typed in here, but on second
read it was a bit... unprofessional. ;)  It involved padlocks on doors
and a dozen angry wild roos let loose in the office.

> running the scanpst tool will fix the corruptions and stop the outlook
> crashes. In addition, I've found the user with the biggest complaints
> about performance has a 9GB pst file, so a little pruning will improve
> that I suspect.

One effective way to protect stupid users from themselves is mailbox
quotas.  There are none when the MUA owns its mailbox file.  You could
implement NTFS quotas on the user home directory.  Not sure how Outlook
would, or could, handle a disk quota error.  Probably not something MS
programmers would have considered, as they have that Exchange groupware
product they'd rather sell you.

Sounds like it's time to switch them to a local IMAP server such as
Dovecot.  Simple to install in a Debian VM.  Probably not so simple to
get the users to migrate their mail to it.

> So, I think between the above couple of things, and all the other work
> already done, the customer is relatively comfortable (I won't say happy,
> but maybe if we can survive a few weeks without any disaster...).

I hear ya on that.

> Personally, I'd like to improve the RAID performance, just because it
> should, but at least I can relax a little, and dedicate some time to
> other jobs, etc...

I'm not convinced at this point you don't already have it.  You're
basing that assumption on a single disk tester program, and you're not
even running the correct set of tests.  Those above may prove more telling.

> So, summary:
> 1) Disable HT
> 2) Increase test LV to 100G
> 3) Re-run fio test
> 4) Re-collect CPU stats

  5) Get all cores to TS VMs

> Sound good?

Yep.

-- 
Stan


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance
  2013-02-21  3:10                                                             ` Adam Goryachev
@ 2013-02-22 11:19                                                               ` Stan Hoeppner
  2013-02-22 15:25                                                                 ` Charles Polisher
  0 siblings, 1 reply; 131+ messages in thread
From: Stan Hoeppner @ 2013-02-22 11:19 UTC (permalink / raw)
  To: Adam Goryachev; +Cc: Dave Cundiff, linux-raid

On 2/20/2013 9:10 PM, Adam Goryachev wrote:
> On 21/02/13 11:45, Stan Hoeppner wrote:

> Not a problem, the root SSD is not really in question here, it is not
> relevant to the system performance overall, it was just as a comparative
> value...

Yes, I was simply explaining one reason the results on a per drive basis
would be a little lower for the root drive.

> Is there a way to tell FIO to do random read/write tests?

Yes.  The job file in my previous email is exclusively random tests.  It
also contains things that should give results that reflect more closely
your actual workloads.

...
> You can call it any number of things, including pissing in the wind, but
> sometimes it just makes life easier when doing a tender/proposal for a
> prospective client to tick the box "do you have a disaster recovery
> plan, does it include offsite/remote computer facilities/whatever... A
> lot of these are government or corporate tenders where in reality, it
> would never make a difference, but they feel like they need to ask, and
> saying no gives a competitor some advantage.

...
> The main aim would be to allow recovery from a localised disaster for
> all the remote offices, while head office might get trashed, at least if
> the remote offices can continue with business as usual, then there is
> still an income/etc. If there was a real disaster
> (earthquake/flooding/etc) then they are not likely to be doing much
> business in the short term, but soon after the recovery, they would want
> to be ready to provide services, whether that is by one of the remote
> offices handling the workload, etc.

If you're doing real DAR planning, you're going to need more than just a
block IO mirror at the remote site, obviously.  You also need a plan in
place to get people connected with the data on it, get them working
again, which means workstations and office space, and have it all lined
up and in place before disaster strikes.  It's literally just like
football, tennis, etc.  You have to practice before the big match.

I've developed DAR plans for two former employers, implemented one.  One
decided it was too costly.  The other gave me budget for leasing office
space for a skeleton crew, not just IT but a few people from each dept.
 We had complete duplicate servers and storage, workstations, switches,
router, all pre-configured, phones, etc.  We had a DAR clause added to
our Bell contract and lines in place for phone trunk and a T1 for net.
Upon activation of our DAR plan, Bell would switch our circuits over
with 4 hours of the call, and we could pick up our tapes from the vault
immediately to start our restore from the previous night.  We performed
a test run sans the phone cutover, worked out some kinks, and everything
worked on the second go, up in 24 hours working straight through.

That's how DAR is done properly.  Or at least, one method of many.  This
is called a "warm site".  A cold site would be arranged but unallocated
office space with equipment staged in storage, no circuit arrangement.
Hot site is self evident.

> As I've discussed with other clients, especially where they only have a
> single office, and all staff live locally, having an inter-state
> disaster recovery centre is pretty useless, since with that level of
> disaster you will all be dead anyway, so who really cares :) ie, if a
> nuclear weapon is detonated in my city, my customers won't be alive to
> call me, and I won't be alive to deal with it, and their customers won't
> be alive/etc (ie, does your local butcher need offsite
> backup/replication...)

Nobody doing DAR considers nuclear war.  Earthquake, floods, tsunamis,
tornadoes, terrorism, design flaw structural failure, stuff like that.
Which are survivable, but can leave a facility in shambles, companies in
ruin, and people out of work.

> As mentioned, the business case is minimal, which is why the budget for
> it is minimal. If it can't be achieved with minimal (basically nil)
> expenditure, then it will be delayed for another day. However, having it
> will mostly benefit by adding something else for the salespeople to talk
> about more than the actual functionality/effectiveness.

Hell, if the issue is sales, they're lying if their mouth is open.  Just
tell clients you have it, regardless. ;)

> At worst, having DRBD simply re-sync each night would provide adequate
> advantage/protection.

Of that data.  As I stated above, people need to be able to get to it
and get working afterward.  That's part of the "business case" I was
referring to.

...
> I didn't know all that, but it was definitely a large part of my
> preference. I really dislike corporations that behave badly, and like to
> support the better behaved corporations where possible (ie, basic
> business sense still plays a part, but I'd rather pay 5% more for AMD
> for example).

AMD has already announced their intention to de-emphasize their x86
processor business and focus on the mobile market.  This was a result of
the Bulldozer CPUs performing lower than expected, lower than the
previous generation in many applications, and falling even further
behind Intel in performance.  And a huge loss that quarter.  I hope they
reverse course here...

> All very interesting information, thanks for sharing, I'll keep it in
> mind on my next system spec.

Not only does SuperMicro offer far more Intel server boards than Intel,
they also offer more AMD server boards than anyone.  Here's an
inexpensive server board for Opteron 3000 (AM3+) that has an integrated
8 pot LSI 2308 (9207--same as yours):
http://www.supermicro.com/Aplus/motherboard/Opteron3000/SR56x0/H8SML-7.cfm

-- 
Stan


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance
  2013-02-22 11:19                                                               ` Stan Hoeppner
@ 2013-02-22 15:25                                                                 ` Charles Polisher
  2013-02-23  4:14                                                                   ` Stan Hoeppner
  0 siblings, 1 reply; 131+ messages in thread
From: Charles Polisher @ 2013-02-22 15:25 UTC (permalink / raw)
  To: Stan Hoeppner; +Cc: linux-raid

Stan Hoeppner wrote:
> Not only does SuperMicro offer far more Intel server boards than Intel,
> they also offer more AMD server boards than anyone.  Here's an
> inexpensive server board for Opteron 3000 (AM3+) that has an integrated
> 8 pot LSI 2308 (9207--same as yours):
> http://www.supermicro.com/Aplus/motherboard/Opteron3000/SR56x0/H8SML-7.cfm

Following this thread avidly, it has made me late for work twice this week.

With IPMI added it seems to retail for same price (Amazon USD$399):
http://www.supermicro.com/aplus/motherboard/opteron3000/sr56x0/h8sml-7f.cfm

-- 
Charles



^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance
  2013-02-22 15:25                                                                 ` Charles Polisher
@ 2013-02-23  4:14                                                                   ` Stan Hoeppner
  0 siblings, 0 replies; 131+ messages in thread
From: Stan Hoeppner @ 2013-02-23  4:14 UTC (permalink / raw)
  To: Charles Polisher; +Cc: linux-raid

On 2/22/2013 9:25 AM, Charles Polisher wrote:
> Stan Hoeppner wrote:
>> Not only does SuperMicro offer far more Intel server boards than Intel,
>> they also offer more AMD server boards than anyone.  Here's an
>> inexpensive server board for Opteron 3000 (AM3+) that has an integrated
>> 8 pot LSI 2308 (9207--same as yours):
>> http://www.supermicro.com/Aplus/motherboard/Opteron3000/SR56x0/H8SML-7.cfm
> 
> Following this thread avidly, it has made me late for work twice this week.

Heheh, it's not *that* interesting. ;)

> With IPMI added it seems to retail for same price (Amazon USD$399):
> http://www.supermicro.com/aplus/motherboard/opteron3000/sr56x0/h8sml-7f.cfm

The H8SML above is their bottom rung AMD server board.  The top rung is
this one:
http://www.supermicro.com/Aplus/motherboard/Opteron6000/SR56x0/H8QG7-LN4F.cfm

Quad socket G34, up to 64 cores and 1TB RAM, 16 DDR3 channels up to
192GB/s, on-board LSI MegaRAID 9265-8i w/1GB cache, quad Intel GbE, two
PCIe 2.0 x16 slots and two x8, etc.

Of course any system using this board won't rely on the on-board LSI for
main storage.  It's primarily for boot/OS RAID.

-- 
Stan




^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance - new kernel results - 5x SSD RAID5
  2013-02-21 17:41                     ` RAID performance - new kernel results - 5x SSD RAID5 David Brown
@ 2013-02-23  6:41                       ` Stan Hoeppner
  0 siblings, 0 replies; 131+ messages in thread
From: Stan Hoeppner @ 2013-02-23  6:41 UTC (permalink / raw)
  To: David Brown; +Cc: Adam Goryachev, Dave Cundiff, linux-raid

On 2/21/2013 11:41 AM, David Brown wrote:
...
> HT may not be of help in a pure file server setup, but in many other
> server applications such as web servers (and IMAP, as you mentioned), HT
> is a huge benefit.  

HT benefits workloads with more heavy active processes than cores, which
causes functional unit/pipeline contention.  If your workload has no
resource contention, then HT is of no benefit.  In these cases having it
enabled can actually decrease performance, yes, even with Intel's recent
implementation, and it can cause other issues as I described earlier,
such as simple system administration headaches.

> It is not coincidence that the big server cpu
> architectures (MIPS, Power, SPARC) all use 2 or even 4 time SMT.

Quoting Wikipedia again...but making incorrect assumptions about what it
says.  MIPS CPUs haven't been used in a "big server" for a decade, the
last machine being the Origin 3900, the CPU being the single core
R16000A, which had no SMT.  HPC workloads don't benefit from it.  The
short lived SiCortex machines were obviously "big" with up to 5832 cores
using 6-way SMP SOCs.  These cores did not have SMT either.  And yes,
the SiCortex section of the MIPS page is my edit, as well as some of the
SGI related edits.  (Hated to see SiCortex fold as their machines not
only offered performance and unique features, but had a cool aesthetic
missing in the supercomputer space since the glory days of Cray)

Imagination Technologies today offers two MIPS IP cores with SMT, of
over hundreds of cores/designs in their portfolio.  Both are used in
embedded applications only.  Both are 32 bit CPUs.  And both hit the
market within the past 2 years.  I.e. SMT is very new for MIPS chips.

WRT Power and SPARC these are targeted at consolidation workloads which
can benefit from SMT.  Note that on the Power CPUs destined for HPC
platforms SMT is typically disabled.

So again, whether HT/SMT is of benefit depends entirely on the workload.
 In Adam's iSCSI server case, it decidedly does not.

> I have no numbers of my own to back this up - but I would certainly not
> consider disabling HT on a server without very concrete reasoning.

I've demonstrated the reasoning, twice now.  People fear what they don't
understand.  You fear shutting off HT because you don't yet have a
complete understanding of how it actually works, and when it actually helps.

> (I too was a great fan of AMD, and used them almost exclusively until
> the Core 2 architecture from Intel.  And while I am glad that Intel have
> made very nice chips in recent years, I think it is a shame that they
> did so using ideas copied directly from AMD - and AMD can no longer
> seriously compete.)

Absolute performance was no longer an issue long before Intel introduced
the Core architecture.  The vast majority of cycles on all systems were
executing the idle instruction, still are.  And there were/are few
server applications deployed that performed significantly better with
Xeon than Opteron.  Thus I continued with AMD for a few reasons.  The IO
infrastructure was superior, and it wasn't until QuickPath that Intel
caught up here.  And AMD still offers a better price/performance ratio.

-- 
Stan




^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance - new kernel results
  2013-02-17  9:52             ` RAID performance - new kernel results Adam Goryachev
  2013-02-18 13:20               ` RAID performance - new kernel results - 5x SSD RAID5 Stan Hoeppner
@ 2013-02-23 15:57               ` John Stoffel
  2013-03-01 16:10                 ` Adam Goryachev
  1 sibling, 1 reply; 131+ messages in thread
From: John Stoffel @ 2013-02-23 15:57 UTC (permalink / raw)
  To: Adam Goryachev; +Cc: Dave Cundiff, linux-raid


Adam,

Can I please ask you to sit down and write a paper for USENIX on this
whole issue and how you resolved it?  You and Stan have done a great
job here documenting and discussing the problems, troubleshooting
methods and eventual solution(s) to the problem.  

It would be wonderful to have some diagrams to go with all this
discussion, showing the original network setup, iSCSI disk setup,
etc.  Then how to updated and changed thing to find bottlenecks. 

The interesting thing is the complete slowdown when using LVM
snapshots, which points to major possibilities for performance
improvements there.  But those improvements will be hard to do without
being able to run on real hardware, which is expensive for people to
have at home.  

I've been following this discussion from day one and really enjoying
it and I've learned quite a bit about iSCSI, networking and some of
the RAID issues.  I too run Debian stable on my home NFS/VM/mail/mysql
server and I've been getting frustrated by how far back it is, even
with backports.  I got burned in the past by testing, which is why I
stay on stable, but now I'm feeling like I'm getting burned on stable
too.  *grin*  It's a balancing act for sure!

Thanks,
John

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance - new kernel results - 5x SSD RAID5
  2013-02-22  8:10                       ` Stan Hoeppner
@ 2013-02-24 20:36                         ` Stan Hoeppner
  2013-03-01 16:06                           ` Adam Goryachev
  0 siblings, 1 reply; 131+ messages in thread
From: Stan Hoeppner @ 2013-02-24 20:36 UTC (permalink / raw)
  To: Adam Goryachev; +Cc: Dave Cundiff, linux-raid

On 2/22/2013 2:10 AM, Stan Hoeppner wrote:

Revisiting this briefly...

> 2.  Run fio using this job file and post the results:
> 
> [global]
> filename=/dev/vg0/testlv (assuming this is still correct)
> zero_buffers
> numjobs=16
> thread
> group_reporting
> blocksize=256k
> ioengine=libaio
> iodepth=16
> direct=1
> size=8g
> 
> [read]
> rw=randread
> stonewall
> 
> [write]
> rw=randwrite
> stonewall

When you run the fio test above, use this for grabbing top output.  But
first, if you haven't already, run top interactively and enable per CPU
display, save by hitting 'w'.  Then use:

~$ top -b -n 60 -d 0.25|grep Cpu|sort -n > /some.dir/some.file

This will grab metrics 4 times per second instead of 2 as we did
previously.  This will give better resolution.  It will also sort the
output by CPU making it much easier to see the ramp and load trend on
each.  It should have a run time of ~16 seconds, which should be plenty
of time to launch fio and get the entire run in our top data.  I'm not
sure if we were running it long enough previously to capture (all of)
the write job.  It would probably be wise to upload the top output file
to a pastebin and link it, as it will be 240 lines long.

I'm quite anxious to see the random write results from the test above
and the accompanying CPU burn.  The bandwidth numbers should be a bit
surprising to you and others following this thread.  Ok, I'm probably
sandbagging a bit.  They should shock you.

Then, after everyone has retrieved their jaws from the floor I'll
explain why the numbers are so much higher, and why proper test
methodology is critical to benchmarking.

-- 
Stan


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance - new kernel results - 5x SSD RAID5
  2013-02-24 20:36                         ` Stan Hoeppner
@ 2013-03-01 16:06                           ` Adam Goryachev
  2013-03-02  9:15                             ` Stan Hoeppner
  0 siblings, 1 reply; 131+ messages in thread
From: Adam Goryachev @ 2013-03-01 16:06 UTC (permalink / raw)
  To: stan; +Cc: Dave Cundiff, linux-raid

Hi all,

No, sorry, I haven't curled up and died yet, and I am still working
through this. Things have somewhat calmed down, and I've tried not to
break anything more than it already is, as well as trying to catch up on
sleep.

So, I'm going to run through a quick summary of what has happened to
date, and at the end recap what I'm going to try and achieve this
weekend. Finally, I hope by the end of the weekend, it will run like a
dream :)

So, from the beginning (skip to the end if you remember this/get bored)

I had a SAN (called san1) server which was Debian Stable, with 5 x 480GB
Intel 520s MLC SSD's in a Linux md raid5 array.
On top of the RAID array is DRBD, (for the purposes of the rest of this
project/discussion, it is disconnected from the secondary).
On top of DRBD is LVM2, which divides up the space for each VM
On top of this is iet (iSCSI) to export each LV individually
The server had 4 x 1Gbps ethernet connected in round-robin to the
switch, plus 1 x 1Gbps ethernet for "management" and 1 x 1Gbps ethernet
crossover connection to the secondary DRBD which is in
disconnected/offline mode.

There are 8 Xen servers running Debian Testing, with a single 1Gbps
ethernet connection each, connected to the same switch as above.

Each xen server runs open-iscsi and logs into all available iSCSI
'shares'. This then appears as /dev/sdX which is passed as to the MS
Windows VM running on it (and it has the GPLPV drivers installed).

I was using the deadline scheduler, and it was advised to try changing
to noop, and disable NCQ (ie putting the driver in native IDE mode or
setting queue depth to 1).

I tried the noop in combination with stupidly:
echo 1 > /sys/block/sdb/queue/nr_requests
Which predictably resulted in poor performance. I reversed both
settings, and continued with the deadline scheduler.

At one stage I was asked to collect stats on the switch ports. I've now
done this, (just using mrtg with rrd, polling 5 minute intervals), for
both the 16 port switch with the user traffic and the 48 port switch
with the iSCSI traffic. This shows that at times, I can see the high
traffic on the Windows DC User LAN, and at the same time on the iSCSI
LAN ports for that xen box, and also a pair of LAN ports for the SAN1
box. However, what is interesting is
a) From about 9am to 5pm (minus a dip at lunch time) there is a
consistent 5Mbps to 10Mbps traffic on the user LAN port. This contrasts
with after hours backup traffic peaking at 15Mbps, (uses rsync for backup).
b) During 9am to 5pm, the pair of iSCSI LAN ports are not very busy,
sitting around 5 to 10Mbps each.
c) Tonight the backup started at 8pm, but from almost exactly 6pm, the
user LAN port was mostly idle, while the iSCSI SAN ports both were
running at 80 to over 100Mbps each.

(Remember these are 5 minute averages though...)

When checking the switch stats, I found no jumbo frames were in use.
Since then, the iSCSI LAN is fully jumbo frames enabled, and I do see
plenty of jumbo frames on the ports. The other switch with the user LAN
traffic does not have jumbo frames enabled, there are lots of machines
on the lan which do not support jumbo frames, including switches limited
to 10/100Mbps...

I was seeing a lot of Pause frames on the SAN ports and the windows DC
port.

I was getting delayed write errors from windows. I made the following
changes to resolve this:
a) Disable write cache on all windows machines, on all drives.
(including the Windows DC and Terminal Servers)
b) Installed multipath on the xen boxes, and configured it to queue on
iSCSI failure, this should cause a stall rather than a write failure.

I went backwards and forwards, and learned a lot about network
architecture, 802.3ad, LACP, bonding types, etc. Eventually, removed all
802.3ad configurations, removed roundrobin, and used balance-alb with
MPIO (to actually get more sessions to be able to scale up past a single
port). This isn't the final destination, but the networking side of
things now seems to be working really well/good enough.

One important point to note, is that 802.3ad or LACP on the switch side
meant inbound traffic all used the same link. In addition, Linux didn't
seem to balance outbound traffic well (it uses the mac address, or it
uses the IP address + port to decide which outbound port to use). In one
scenario, 1 of the 4 ports was unused, 1 was dedicated for 1 machine
each, one shared for 2 machines, and one shared for 5 machines. Very
poor balancing. Using balance-alb works MUCH better for traffic in both
directions to be much better balanced.

Even without any config, installing linux multipath and accessing the
same /dev/sdX device showed that Linux would now cache reads for iSCSI.
I did this, but I don't think it made much user level difference.

Have re-aligned my partitions on the 5 x SSD's to align optimally. This
didn't have much impact on performance anyway, but it was one thing to
tick off the list.

I was asked to supply photos of the mess of cabling, since I've now got
3 x 1Gbps ethernet for each of the 8 xen machines, plus 10 x 1Gbps
ethernet for each of the 2 SAN servers. That is a total of 48 cables
just for this set of 10 servers.... I did all cabling using "spare"
cables initially because I forgot I'd be needing a bunch of extra
cables. Once I ordered all new cables, I re-did it all, and also used
plenty of cable ties.
URL to photos will be sent to those who want to see them (off list....).
I'm pretty proud of my effort compared to the first attempt, but I'm
open to comments/suggestions on better cabling methods etc. I've used
Yellow cables to signify the iSCSI network, and blue for the "user"
network. Since they already used blue cables for the user networking
anyway....

Found a limitation in Linux that I couldn't login to more than 32
sessions within a short period of time. So using MPIO to login to 11
LUN's with 4 paths didn't work (44 logins at same time). Limited this to
2 paths, and this works properly.

Upgraded Linux kernel on the SAN1 machine to debian backports (3.2.x) to
bypass the REALLY bad performance for SSD's with the bug in 2.6.26
including the Debian stable version. The new kernel still doesn't solve
the 32 session iSCSI login limit.

Installed irqbalance to assist in balancing IRQ workload across all
available cores on SAN1

After all the above, complaints have fallen off, and are now generally
limited. I do still rarely see high IO load on the DC, and get a dozen
or so complaints from users. eg, there was very high load on the DC from
approx 3:45 to 4:10pm and at the same time I got a bunch of complaints.
I do get the few complaints about slow and stalling, but these are also
much less frequent but enough to be unsettling. I still think there is
some issue, since even these "high loads", they are not anywhere near
the capacity of the system. eg, 20MB/s is about 20 to 25% of what the
maximum capacity should be.




THINGS STILL TO TRY/DO
Could you please feel free to re-arrange the order of these, or let me
know if I should skip/not bother any of them. I'll try to do as much as
possible this weekend, and then see what happens next week.

1) Make sure stripe_cache_size is as least 8192.  If not:
~$ echo 8192 > /sys/block/md0/md/stripe_cache_size
Currently using default 256.

2) Disable HT on the SAN1, retest write performance for single threaded
write issue.
top -b -n 60 -d 0.25|grep Cpu|sort -n > /some.dir/some.file

3) fio tests should use this test config:
[global]
filename=/dev/vg0/testlv (assuming this is still correct)
zero_buffers
numjobs=16
thread
group_reporting
blocksize=256k
ioengine=libaio
iodepth=16
direct=1
size=8g

[read]
rw=randread
stonewall

[write]
rw=randwrite
stonewall

4) Try to connect the SSD's direct to the HBA, bypassing the hotswap
device in case this is limiting to SATA II or similar.

5) Configure the user LAN switch to prioritise RDP traffic. If SMB
traffic is flooding the link, than we need the user to at least feel
happy that the screen is still updating.

6) SAN1 - Get rid of the bond0 with 8 x 1G ports, and use 8 IP
addresses, (one on each port). Properly configure the clients to each
connect to a different pair of ports using MPIO.

7) Upgrade DRBD to 8.4.3
See https://blogs.linbit.com/p/469/843-random-writes-faster/

8) Lie to DRBD, pretend we have a BBU

9) Check out the output of xm top
I presume this is to ensure the dom0 CPU is not too busy to keep up with
handling the iSCSI/ethernet traffic/etc.

10) Run benchmarks on a couple of LV's on the san1 machine, if these
pass the expected performance level, then re-run on the physical
machines (xen). If that passes, then run inside a VM.

11) Collect the output from iostat -x 5 when the problem happens

12) disable NCQ (ie putting the driver in native IDE mode or setting
queue depth to 1).

I still haven't worked out how to actually do this, but now I'm using
the LSI card, maybe it is easier/harder, and apparently it shouldn't
make a lot of difference anyway.

13) Add at least a second virtual CPU (plus physical cpu) to the windows
DC. It is still single CPU due to the windows HAL version. Prefer to
provide a total of 4 CPU's to the VM, leaving 2 for the physical box,
same as all the rest of the VM's and physicals.

14) Upgrade windows 2000 DC to windows 2003, potentially there was some
xen/windows issue with performance. Previously I had an issue with
Win2003 with no service packs, and it was resolved by upgrade to service
pack 4.

15) "Make sure all LVs are aligned to the underlying md device geometry.
 This will eliminate any possible alignment issues."
What does this mean? The drive partitions are now aligned properly, but
how does LVM allocate the blocks for each LV, and how do I ensure it
does so optimally? How do I even check this?

16) RAID5:
md1 : active raid5 sdb1[7] sdd1[9] sde1[5] sdc1[8] sda1[6]
      1863535104 blocks super 1.2 level 5, 64k chunk, algorithm 2 [5/5]
[UUUUU]
      bitmap: 2/4 pages [8KB], 65536KB chunk
Is it worth reducing the chunk size from 64k down to 16k or even smaller?

17) Consider upgrading the dual port network card on the DC box to a
4port card, use 2 ports for iSCSI and 2 ports for the user lan.
Configure the user lan side as LACP, so it can provide up to 1G for each
of 2 SMB users simultaneously. Means total 2Gbps for iSCSI and total
2Gbps for SMB, but only 1Gbps SMB for each user.

18) Ability to request the SSD to do garbage collection/TRIM/etc at
night (off peak)

19) Check IO size, seems to prefer doing a lot of small IO instead of
big blocks. Maybe due to drbd.

Thanks again to everyone's input/suggestions.

Regards,
Adam

-- 
Adam Goryachev
Website Managers
www.websitemanagers.com.au

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance - new kernel results
  2013-02-23 15:57               ` RAID performance - new kernel results John Stoffel
@ 2013-03-01 16:10                 ` Adam Goryachev
  2013-03-10 15:35                   ` Charles Polisher
  0 siblings, 1 reply; 131+ messages in thread
From: Adam Goryachev @ 2013-03-01 16:10 UTC (permalink / raw)
  To: John Stoffel; +Cc: Dave Cundiff, linux-raid

On 24/02/13 02:57, John Stoffel wrote:
> 
> Adam,
> 
> Can I please ask you to sit down and write a paper for USENIX on this
> whole issue and how you resolved it?  You and Stan have done a great
> job here documenting and discussing the problems, troubleshooting
> methods and eventual solution(s) to the problem.  
> 
> It would be wonderful to have some diagrams to go with all this
> discussion, showing the original network setup, iSCSI disk setup,
> etc.  Then how to updated and changed thing to find bottlenecks. 
> 
> The interesting thing is the complete slowdown when using LVM
> snapshots, which points to major possibilities for performance
> improvements there.  But those improvements will be hard to do without
> being able to run on real hardware, which is expensive for people to
> have at home.  
> 
> I've been following this discussion from day one and really enjoying
> it and I've learned quite a bit about iSCSI, networking and some of
> the RAID issues.  I too run Debian stable on my home NFS/VM/mail/mysql
> server and I've been getting frustrated by how far back it is, even
> with backports.  I got burned in the past by testing, which is why I
> stay on stable, but now I'm feeling like I'm getting burned on stable
> too.  *grin*  It's a balancing act for sure!

I've never writen anything like that, but I think I could write a book
on this. I keep thinking I should get a blog and put stuff like this on
there, but there is always something else to do, and I'm not the sort of
person to write in my diary every day :)

I've already written up a sort of non-technical summary for the client
(about 5 pages), and just sent a non-detailed technical summary to the
list. Once everything is completed and settled, I can try and combine
those two, maybe throw in a bunch of extra details (command lines,
config files, etc), and see where it ends up. I suppose you are
volunteering as editor <G>

Regards,
Adam

-- 
Adam Goryachev
Website Managers
www.websitemanagers.com.au

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance - new kernel results - 5x SSD RAID5
  2013-03-01 16:06                           ` Adam Goryachev
@ 2013-03-02  9:15                             ` Stan Hoeppner
  2013-03-02 17:07                               ` Phil Turmel
  2013-03-03 17:32                               ` Adam Goryachev
  0 siblings, 2 replies; 131+ messages in thread
From: Stan Hoeppner @ 2013-03-02  9:15 UTC (permalink / raw)
  To: Adam Goryachev; +Cc: Linux RAID

First reply missed the list.

On 3/1/2013 10:06 AM, Adam Goryachev wrote:
> Hi all,

Hi Adam,

This is really long so I'll hit the important parts and try to be brief.

> THINGS STILL TO TRY/DO
> Could you please feel free to re-arrange the order of these, or let me
> know if I should skip/not bother any of them. I'll try to do as much as
> possible this weekend, and then see what happens next week.
> 
> 1) Make sure stripe_cache_size is as least 8192.  If not:
> ~$ echo 8192 > /sys/block/md0/md/stripe_cache_size
> Currently using default 256.

Critical- a low value here may be severely limiting SSD write
throughput.  And I suspect this default 256KB is more than a minor
factor in your low FIO write performance.

> 2) Disable HT on the SAN1, retest write performance for single threaded
> write issue.
> top -b -n 60 -d 0.25|grep Cpu|sort -n > /some.dir/some.file
> 
> 3) fio tests should use this test config:
> [global]
> filename=/dev/vg0/testlv (assuming this is still correct)
> zero_buffers
> numjobs=16
> thread
> group_reporting
> blocksize=256k
> ioengine=libaio
> iodepth=16
> direct=1
> size=8g
> 
> [read]
> rw=randread
> stonewall
> 
> [write]
> rw=randwrite
> stonewall

This test should provide a bit more realistic picture of your current
write throughput capability.  "zero_buffers" causes FIO to use a
repeating data pattern instead of the default random pattern.  The Intel
520 480 SSDs have the Sandforce 2281 controller which performs on the
fly compression, to both increase performance and increase effective
capacity.  Most user data is compressible.  This should show an increase
in throughput over previous tests.

Second, this test uses 16 write threads instead of one, which will make
sure we're filling the queue.  All FIO testing you've done has been
single threaded with AIO, which may or may not have been filling the queue.

Third, this test is fully random IO, which better mimics your real world
workload than your previous testing.  Depending on these Intel SSDs,
this may increase or decrease both the read and/or write throughput
results.  I'd guess you'll see decreased read but increased write.

> 4) Try to connect the SSD's direct to the HBA, bypassing the hotswap
> device in case this is limiting to SATA II or similar.

You don't have to touch the hardware.  Simply do:

~$ dmesg|grep "link up"
ata3: SATA link up 6.0 Gbps (SStatus 113 SControl 310)

This tells you the current data rate of each SAS/SATA link on all
controllers.  With a boot SSD on mobo and 5 on the LSI, you should see 6
at 6.0 Gbps and 1 at 3.0 Gbps.  Maybe another one if you have a DVD on SATA.

> 5) Configure the user LAN switch to prioritise RDP traffic. If SMB
> traffic is flooding the link, than we need the user to at least feel
> happy that the screen is still updating.

Can't hurt but only help.

> 6) SAN1 - Get rid of the bond0 with 8 x 1G ports, and use 8 IP
> addresses, (one on each port). Properly configure the clients to each
> connect to a different pair of ports using MPIO.

The connections are done with the iscsiadmin.  MPIO simply uses the
resulting two local SCSI devices.  Remember the iscsiadm command line
args to log each Xen client interface (IP) into only one san1 interface
(IP).

> 7) Upgrade DRBD to 8.4.3
> See https://blogs.linbit.com/p/469/843-random-writes-faster/

Looks good.

> 8) Lie to DRBD, pretend we have a BBU

Not a good idea.  Your Intel SSDs are consumer, not enterprise, and thus
don't have the power loss write capacitor.  And you don't have BBU in
the other SAN box.  Thus you have no capability like that of BBU.
Either box could crash, and UPS are not infallible, thus you'd better do
write-through instead of write-back, i.e. don't lie to DRBD.  Any added
performance isn't worth the potential disaster.

> 9) Check out the output of xm top
> I presume this is to ensure the dom0 CPU is not too busy to keep up with
> handling the iSCSI/ethernet traffic/etc.

One of those AMD cores should be plenty for the hypervisor at peak IO
load, as long as no VMs are allowed to run on it.  Giving a 2nd core to
the DC VM may help though.

> 10) Run benchmarks on a couple of LV's on the san1 machine, if these
> pass the expected performance level, then re-run on the physical
> machines (xen). If that passes, then run inside a VM.

For getting at client VM performance, start testing there.  Only if you
can't hit close to 100MB/s, then drill down through the layers.

> 11) Collect the output from iostat -x 5 when the problem happens

Not sure what this was for.  Given the link throughput numbers you
posted the user complaints are not due to slow IO on the SAN server, but
most likely a problem with the number of cores available to each TS VM
on the Xen boxen.

> 12) disable NCQ (ie putting the driver in native IDE mode or setting
> queue depth to 1).
>
> I still haven't worked out how to actually do this, but now I'm using
> the LSI card, maybe it is easier/harder, and apparently it shouldn't
> make a lot of difference anyway.

Yeah, don't bother with this-- would only slightly help, if at all.

> 13) Add at least a second virtual CPU (plus physical cpu) to the windows
> DC. It is still single CPU due to the windows HAL version. Prefer to
> provide a total of 4 CPU's to the VM, leaving 2 for the physical box,
> same as all the rest of the VM's and physicals.

Probably won't help much but can't hurt.  Give it a low to-do priority.

> 14) Upgrade windows 2000 DC to windows 2003, potentially there was some
> xen/windows issue with performance. Previously I had an issue with
> Win2003 with no service packs, and it was resolved by upgrade to service
> pack 4.

Good idea.  W2K was around long before the virtual machine craze.

> 15) "Make sure all LVs are aligned to the underlying md device geometry.
>  This will eliminate any possible alignment issues."
> What does this mean? The drive partitions are now aligned properly, but
> how does LVM allocate the blocks for each LV, and how do I ensure it
> does so optimally? How do I even check this?

I'm not an LVM user so I can't give you command lines.  But what I can
tell you follows, and it is somewhat critical to RMW performance, more
for rust but also for SSD to a lesser degree.

> 16) RAID5:
> md1 : active raid5 sdb1[7] sdd1[9] sde1[5] sdc1[8] sda1[6]
>       1863535104 blocks super 1.2 level 5, 64k chunk, algorithm 2 [5/5]
> [UUUUU]
>       bitmap: 2/4 pages [8KB], 65536KB chunk

Your md/RAID5 stripe width is 4 x 64KB = 256KB.  Thus every slice you
create for LVM should start on a sector that is a multiple of 256KB.
Say your first LVM slice of the md device is to be 25GB.  It starts at
sector 0 of the md device, so your ending sector of the slice would be,
assuming my math fu is up to the task:

(262,144 * 100,000)=(26,414,400,000 bytes / 512)= sector 51,590,625

So your next slice should start at sector 51,590,626.  What this does is
make sure your LVM blocks line up evenly atop the md/RAID stripes.  If
they don't and they lay over two consecutive md stripes you can get
double the RMW penalty.  For a typical single power user PC this isn't a
huge issue due to the massive IOPS of SSDs.  But for a server such as
yours with lots of random user IO and potentially snapshots and DRBD
mirroring, etc, it could cause significant slowdown due to the extra RMW IO.

> Is it worth reducing the chunk size from 64k down to 16k or even smaller?

64KB chunks should be fine here.  Any gains with a smaller chunk would
be small, and would pale in comparison to the amount of PITA required to
redo the array and everything currently sitting atop it.  Remember you'd
have to destroy it and start over.  You can't change chunk size of an
existing array.

> 17) Consider upgrading the dual port network card on the DC box to a
> 4port card, use 2 ports for iSCSI and 2 ports for the user lan.
> Configure the user lan side as LACP, so it can provide up to 1G for each
> of 2 SMB users simultaneously. Means total 2Gbps for iSCSI and total
> 2Gbps for SMB, but only 1Gbps SMB for each user.

Or simply add another single port $15 Realtek 8111/8168 PCIe x1 NIC,
which matches the onboard ethernet, for user traffic--user traffic on
Realtek, iSCSI on Intel.  This will allow the DC box to absorb sporadic
large SMB transfers without slowing all other users' SMB traffic.  Given
the cost per NIC you can easily do this on all Xen boxen so you still
have SC migration ability across all Xen.

> 18) Ability to request the SSD to do garbage collection/TRIM/etc at
> night (off peak)

This isn't possible.  GC is an SSD firmware function.  TRIM can only be
issued by a filesystem driver.  I doubt one will ever be able to pass
TRIM commands down from Windows guest SCSI layer through exported Xen
disks across iSCSI to iscsi-target to md to SSD.  Remember TRIM is a
filesystem function.  In your setup you must simply rely on the SSD
firmware to handle GC and without TRIM.

> 19) Check IO size, seems to prefer doing a lot of small IO instead of
> big blocks. Maybe due to drbd.

DRBD does cause the small IOs.  DRBD simply mirrors changes to the array
device.  Your client application dictate the size of IOs.

> Thanks again to everyone's input/suggestions.

Any time.  I have one more suggestion that might make a world of
difference to your users.

You did not mention the virtual CPU configuration on the Xen TS boxen.
Are you currently assigning 5 of 6 cores as vCPUs to both Windows TS
instances?  If not you should be.  You should be able to assign a vCPU
or an SMP group, to more than one guest, or vice versa.  Doing so will
allow either guest to use all available cores when needed.  If you
dedicate one to the hypervisor that leaves 5.  I'm not sure if Windows
will run with an asymmetric CPU count.  If not, assign cores 1-4 to one
guest and cores 2-5 to the other, assuming core 0 is dedicated to the
hypervisor.  If Xen won't allow this, the create one SMP group of 4
cores and assign it to both guests.  I've never used Xen so my
terminology is likely off.  This is trivial to do with ESX.

If you are currently only assigning 1 or 2 cores to each Windows TS
guest, the additional cores should make a huge difference to your users,
depending on the applications they run.  For example, a user viewing a
large and/or complex PDF in the horribly CPU inefficient Adobe Reader
(or $deity forbid the browser plugin) such as PDFs with embedded
engineering schematics, can easily eat all the cycles of one or even two
cores for 10-15 seconds or more at a time, multiple times while paging
through the file.

A perfect example: using Adobe Reader (not the plugin) with this
SuperMicro chassis manual:
http://www.supermicro.com/manuals/chassis/tower/SC417.pdf

eats 100% of one of my two 3GHz AMD cores for about 2-5 seconds each
time it renders one of the vector graphics chassis schematic pages.
With some of the schematics it eats all of BOTH cores for about 3-5
seconds as recent versions of Acrobat do threaded processing of vector
graphics.  This is with Acrobat 10.1.6 (latest) on WinXP, 3GHz AthlonII
x2, dual channel DDR3-1333, PCIe x16 nVidia GT240, Corsair SSD--not a
slow box.  Rendering something like this on Terminal Services would
likely increase CPU burn time and rendering times many fold over that of
my workstation.

If you have a TS user doing something like this with only 1-2 cores per
TS VM it will bring everyone to their knees for many seconds, possibly
minutes, at a time.  And this isn't limited to Adobe reader.  There are
many browser plugin apps that will do the same, or worse.  Flash comes
to mind.  I've come across some poorly written Flash web sites will eat
all of a CPU like this just idling on the index page.  Watching a Flash
movie trailer at 1080 or 720 HD will bring your TS to its knees as well.
 These are but two application examples that will bring a TS to its
knees.  If you already have cores for TS VMs covered my apologies for
the extra reading.  Maybe it will be helpful to others.

-- 
Stan




^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance - new kernel results - 5x SSD RAID5
  2013-03-02  9:15                             ` Stan Hoeppner
@ 2013-03-02 17:07                               ` Phil Turmel
  2013-03-02 23:48                                 ` Stan Hoeppner
  2013-03-03 15:19                                 ` Adam Goryachev
  2013-03-03 17:32                               ` Adam Goryachev
  1 sibling, 2 replies; 131+ messages in thread
From: Phil Turmel @ 2013-03-02 17:07 UTC (permalink / raw)
  To: stan; +Cc: Adam Goryachev, Linux RAID

On 03/02/2013 04:15 AM, Stan Hoeppner wrote:
> On 3/1/2013 10:06 AM, Adam Goryachev wrote:

>> 15) "Make sure all LVs are aligned to the underlying md device geometry.
>>  This will eliminate any possible alignment issues."
>> What does this mean? The drive partitions are now aligned properly, but
>> how does LVM allocate the blocks for each LV, and how do I ensure it
>> does so optimally? How do I even check this?
> 
> I'm not an LVM user so I can't give you command lines.  But what I can
> tell you follows, and it is somewhat critical to RMW performance, more
> for rust but also for SSD to a lesser degree.

Run "dmsetup table" and look at the start sectors for your volumes:

> Fast-Root: 0 314572800 linear 9:3 3072

This volume starts at sector 3072 (1.5MB) on /dev/sda3.  So the volume
alignment within LVM is 512K.

>> Is it worth reducing the chunk size from 64k down to 16k or even smaller?
> 
> 64KB chunks should be fine here.  Any gains with a smaller chunk would
> be small, and would pale in comparison to the amount of PITA required to
> redo the array and everything currently sitting atop it.  Remember you'd
> have to destroy it and start over.  You can't change chunk size of an
> existing array.

Actually, you can.  For levels 0 and 4,5,6.

HTH,

Phil

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance - new kernel results - 5x SSD RAID5
  2013-03-02 17:07                               ` Phil Turmel
@ 2013-03-02 23:48                                 ` Stan Hoeppner
  2013-03-03  2:35                                   ` Phil Turmel
  2013-03-03 15:19                                 ` Adam Goryachev
  1 sibling, 1 reply; 131+ messages in thread
From: Stan Hoeppner @ 2013-03-02 23:48 UTC (permalink / raw)
  To: Phil Turmel; +Cc: Adam Goryachev, Linux RAID

On 3/2/2013 11:07 AM, Phil Turmel wrote:
> On 03/02/2013 04:15 AM, Stan Hoeppner wrote:
>> On 3/1/2013 10:06 AM, Adam Goryachev wrote:
...
>>> Is it worth reducing the chunk size from 64k down to 16k or even smaller?
>>
>> 64KB chunks should be fine here.  Any gains with a smaller chunk would
>> be small, and would pale in comparison to the amount of PITA required to
>> redo the array and everything currently sitting atop it.  Remember you'd
>> have to destroy it and start over.  You can't change chunk size of an
>> existing array.
> 
> Actually, you can.  For levels 0 and 4,5,6.

First, I'll reiterate that a smaller chunk size likely is not going to
yield real workload gains for Adam.  And it obviously would decrease his
FIO numbers, making him think performance decreased, even if it actually
increased slightly with his real workload.

Speaking strictly now from a knowledge transfer standpoint, does this
chunk size change feature go back a ways or does it require a fairly
recent kernel and/or mdadm?  Are there any prerequisites or special
considerations different from any other reshape operation?

-- 
Stan


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance - new kernel results - 5x SSD RAID5
  2013-03-02 23:48                                 ` Stan Hoeppner
@ 2013-03-03  2:35                                   ` Phil Turmel
  0 siblings, 0 replies; 131+ messages in thread
From: Phil Turmel @ 2013-03-03  2:35 UTC (permalink / raw)
  To: stan; +Cc: Adam Goryachev, Linux RAID

On 03/02/2013 06:48 PM, Stan Hoeppner wrote:
> On 3/2/2013 11:07 AM, Phil Turmel wrote:
>> On 03/02/2013 04:15 AM, Stan Hoeppner wrote:
>>> On 3/1/2013 10:06 AM, Adam Goryachev wrote:
> ...
>>>> Is it worth reducing the chunk size from 64k down to 16k or even smaller?
>>>
>>> 64KB chunks should be fine here.  Any gains with a smaller chunk would
>>> be small, and would pale in comparison to the amount of PITA required to
>>> redo the array and everything currently sitting atop it.  Remember you'd
>>> have to destroy it and start over.  You can't change chunk size of an
>>> existing array.
>>
>> Actually, you can.  For levels 0 and 4,5,6.
> 
> First, I'll reiterate that a smaller chunk size likely is not going to
> yield real workload gains for Adam.  And it obviously would decrease his
> FIO numbers, making him think performance decreased, even if it actually
> increased slightly with his real workload.

No contest.  I have effectively no experience at these hardware
performance levels.

> Speaking strictly now from a knowledge transfer standpoint, does this
> chunk size change feature go back a ways or does it require a fairly
> recent kernel and/or mdadm?  Are there any prerequisites or special
> considerations different from any other reshape operation?

From the announcement for mdadm version 3.1, October 2009:

> It contains significant feature enhancements over 3.0.x
> 
> The brief change log is:
>    -    Support --grow to change the layout of RAID4/5/6
>    -    Support --grow to change the chunksize of raid 4/5/6
>    -    Support --grow to change level from RAID1 -> RAID5 -> RAID6 and
>         back.
>    -    Support --grow to reduce the number of devices in RAID4/5/6.
>    -    Support restart of these grow options which assembling an array 
> 	which is partially grown.
>    -    Assorted tests of this code, and of different RAID6 layouts.
> 
> Note that a 2.6.31 or later is needed to have access to these.
> Reducing devices in a RAID4/5/6 requires 2.6.32.
> Changing RAID5 to RAID1 requires 2.6.33.

So I'm sure there are plenty of older systems that can't do this, but
current distros should all have it.

Phil

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance - new kernel results - 5x SSD RAID5
  2013-03-02 17:07                               ` Phil Turmel
  2013-03-02 23:48                                 ` Stan Hoeppner
@ 2013-03-03 15:19                                 ` Adam Goryachev
  2013-03-04  1:31                                   ` Phil Turmel
  2013-03-04  5:25                                   ` Stan Hoeppner
  1 sibling, 2 replies; 131+ messages in thread
From: Adam Goryachev @ 2013-03-03 15:19 UTC (permalink / raw)
  To: Phil Turmel, stan; +Cc: Linux RAID

Phil Turmel <philip@turmel.org> wrote:

>On 03/02/2013 04:15 AM, Stan Hoeppner wrote:
>> On 3/1/2013 10:06 AM, Adam Goryachev wrote:
>
>>> 15) "Make sure all LVs are aligned to the underlying md device
>>> geometry. This will eliminate any possible alignment issues."
>>> What does this mean? The drive partitions are now aligned properly,
>>> but how does LVM allocate the blocks for each LV, and how do I
>>> ensure it does so optimally? How do I even check this?
>> 
>> I'm not an LVM user so I can't give you command lines.  But what I
>> can tell you follows, and it is somewhat critical to RMW performance,
>> more for rust but also for SSD to a lesser degree.
>Run "dmsetup table" and look at the start sectors for your volumes:
>Fast-Root: 0 314572800 linear 9:3 3072
>This volume starts at sector 3072 (1.5MB) on /dev/sda3.  So the volume
>alignment within LVM is 512K.

I see this (for the first LV)
vg0-hostname: 0 204808192 linear 147:2 512
So, I'm guessing mine is starting at 512, and is also aligned at 512k ?

Also, pvdisplay tells me the PE Size is 4M, so I'm assuming that regardless of how the LV's are arranged, they will always be 512k aligned?

So, is that enough to be sure that this is not an issue?

Thanks,
Adam



--
Adam Goryachev
Website Managers
www.websitemanagers.com.au

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance - new kernel results - 5x SSD RAID5
  2013-03-02  9:15                             ` Stan Hoeppner
  2013-03-02 17:07                               ` Phil Turmel
@ 2013-03-03 17:32                               ` Adam Goryachev
  2013-03-04 12:20                                 ` Stan Hoeppner
  1 sibling, 1 reply; 131+ messages in thread
From: Adam Goryachev @ 2013-03-03 17:32 UTC (permalink / raw)
  To: stan; +Cc: Linux RAID

Stan Hoeppner <stan@hardwarefreak.com> wrote:
>> 1) Make sure stripe_cache_size is as least 8192.  If not:
>> ~$ echo 8192 > /sys/block/md0/md/stripe_cache_size
>> Currently using default 256.

Done now

>> 2) Disable HT on the SAN1, retest write performance for single
>threaded
>> write issue.
>> top -b -n 60 -d 0.25|grep Cpu|sort -n > /some.dir/some.file

Done now

There seems to be only one row from the top output which is interesting:
Cpu0 : 3.6%us, 71.4%sy, 0.0%ni, 0.0%id, 10.7%wa, 0.0%hi, 14.3%si, 0.0%,st

Every other line had a high value for %id, or %wa, with lower values for %sy. This was during the second 'stage' of the fio run, earlier in the fio run there was no entry even close to showing the CPU as busy.

>> 3) fio tests should use this test config:

Sorry about the brief output, but I'm retyping it manually :(
READ: io=131072MB, aggrb=2506MB/s, minb=2566MB/s, maxb=2566MB/s, mint=52303msec, maxt=52303msec
WRITE: io=131072MB, aggrb=1262MB/s, minb=1292MB/s, maxb=1292MB/s, mint=103882msec, maxt=103882msec

Is this what I should be expecting now?
To me, it looks like it is close enough, but if you think I should be able to get even faster, then I will certainly investigate further.

>> 4) Try to connect the SSD's direct to the HBA, bypassing the hotswap
>> device in case this is limiting to SATA II or similar.
>You don't have to touch the hardware.  Simply do:
>~$ dmesg|grep "link up"
>ata3: SATA link up 6.0 Gbps (SStatus 113 SControl 310)
>
>This tells you the current data rate of each SAS/SATA link on all
>controllers.  With a boot SSD on mobo and 5 on the LSI, you should see
>6 at 6.0 Gbps and 1 at 3.0 Gbps.  Maybe another one if you have a DVD on
>SATA.

Actually, the motherboard has 2 x SATA3 ports and 4 x SATA2 ports, so I see one entry for the motherboard attached SSD at 6.0Gbps

However, all the drives connected to the LSI do not show this line.
I see the following entry though:
ahci 0000:00:1f.2: AHCI 0001.0300 32 slots 6 ports 6 Gbps 0x3f impl SATA mode
Perhaps that is related to the LSI ?
I also see lines like this:
scsi 0:0:0:0 SATA: handle(0x0009), sas_addr(0x4433221103000000), phy(3), device_name(0x0000000000000000)
scsi 0:0:0:0 SATA: enclosure_logical_id(0x500605b005b6b920), slot(0)

I had a poke around in /sys and the rest of the dmesg output, but couldn't see anything to clearly identify the speed they were connected at. My supplier suggests that the hot swap bay is just a physical connection, and that it shouldn't be doing any sort of "smarts" therefore it should not "reduce" it down to SATA2 etc... I'll leave this for now, but would still like to be sure that it is working properly at SATA3 speeds...

>> 5) Configure the user LAN switch to prioritise RDP traffic. If SMB
>> traffic is flooding the link, than we need the user to at least feel
>> happy that the screen is still updating.
>Can't hurt but only help.

Looked at this, but it gave me a headache... I'll have to come back to it, maybe discuss on the netgear forums to ensure I understand how all the different pieces fit together before I try and implement it.

>> 6) SAN1 - Get rid of the bond0 with 8 x 1G ports, and use 8 IP
>> addresses, (one on each port). Properly configure the clients to each
>> connect to a different pair of ports using MPIO.
>
>The connections are done with the iscsiadmin.  MPIO simply uses the
>resulting two local SCSI devices.  Remember the iscsiadm command line
>args to log each Xen client interface (IP) into only one san1 interface
>(IP).

Skipped, running out of time.... 

>> 7) Upgrade DRBD to 8.4.3
>> See https://blogs.linbit.com/p/469/843-random-writes-faster/
> Looks good

I can't seem to checkout 8.4.3, only 8.4.2 seems to work. I think I would be better to wait on this one as I don't really like to be on the bleeding edge. If everything else doesn't improve things, then I'll come back to this one.

>> 9) Check out the output of xm top
>> I presume this is to ensure the dom0 CPU is not too busy to keep up
>> with handling the iSCSI/ethernet traffic/etc.
>One of those AMD cores should be plenty for the hypervisor at peak IO
>load, as long as no VMs are allowed to run on it.  Giving a 2nd core to
>the DC VM may help though.
>
>> 10) Run benchmarks on a couple of LV's on the san1 machine, if these
>> pass the expected performance level, then re-run on the physical
>> machines (xen). If that passes, then run inside a VM.
>
>For getting at client VM performance, start testing there.  Only if you
>can't hit close to 100MB/s, then drill down through the layers.

I'll have to find some software to run benchmarks within the windows VM's, but at the moment, using fio with the above config (changing the size from 8g down to 2g), and running it on the xen physical machine (dom0) gives this result (only machine running):
READ: io=32768MB, aggrb=237288KB/s, minb=237288KB/s, maxb=237288KB/s, mint=141408msec, maxt=141408msec
WRITE: io=32768MB, aggrb=203307KB/s, minb=203307KB/s, maxb=203307KB/s, mint=165043msec, maxt=165043msec

So, 230MB/s read and 200MB/s write, using 2 x 1Gbps ethernet seems pretty good. 

>> 11) Collect the output from iostat -x 5 when the problem happens
>Not sure what this was for.  Given the link throughput numbers you
>posted the user complaints are not due to slow IO on the SAN server,
>but most likely a problem with the number of cores available to each 
>TS VM on the Xen boxen.

The TS VM's have dedicated 4 cores, and the physical machines have dedicated 2 cores. I don't think the TS are CPU limited, this was for me to watch this on the DC to ensure it's single CPU was not limiting performance

>> 13) Add at least a second virtual CPU (plus physical cpu) to the
>> windows DC. It is still single CPU due to the windows HAL version.
>> Prefer to provide a total of 4 CPU's to the VM, leaving 2 for the 
>> physical box, same as all the rest of the VM's and physicals.
>Probably won't help much but can't hurt.  Give it a low to-do priority.

Will do next time....

>> 14) Upgrade windows 2000 DC to windows 2003, potentially there was
>> some xen/windows issue with performance. Previously I had an issue
>> with Win2003 with no service packs, and it was resolved by upgrade to
>> service pack 4.
>Good idea.  W2K was around long before the virtual machine craze.

Running out of time, will try this another time. Can do it remotely anyway...

>> 15) "Make sure all LVs are aligned to the underlying md device

Should be done

>> 17) Consider upgrading the dual port network card on the DC box to a
>> 4port card, use 2 ports for iSCSI and 2 ports for the user lan.
>> Configure the user lan side as LACP, so it can provide up to 1G for
>> each of 2 SMB users simultaneously. Means total 2Gbps for iSCSI
>> and total 2Gbps for SMB, but only 1Gbps SMB for each user.
>Or simply add another single port $15 Realtek 8111/8168 PCIe x1 NIC,
>which matches the onboard ethernet, for user traffic--user traffic on
>Realtek, iSCSI on Intel.  This will allow the DC box to absorb sporadic
>large SMB transfers without slowing all other users' SMB traffic. 
>Given the cost per NIC you can easily do this on all Xen boxen so you still
>have SC migration ability across all Xen.

Good point, that is probably a smarter idea overall. However, I'll leave this until I run out of other options. 100MB/s *SHOULD* be enough considering it only had 100MB/s previously anyway.

>Any time.  I have one more suggestion that might make a world of
>difference to your users.
>
>You did not mention the virtual CPU configuration on the Xen TS boxen.
>Are you currently assigning 5 of 6 cores as vCPUs to both Windows TS
>instances?  If not you should be.  You should be able to assign a vCPU
>or an SMP group, to more than one guest, or vice versa.  Doing so will
>allow either guest to use all available cores when needed.  If you
>dedicate one to the hypervisor that leaves 5.  I'm not sure if Windows
>will run with an asymmetric CPU count.  If not, assign cores 1-4 to one
>guest and cores 2-5 to the other, assuming core 0 is dedicated to the
>hypervisor.  If Xen won't allow this, the create one SMP group of 4
>cores and assign it to both guests.  I've never used Xen so my
>terminology is likely off.  This is trivial to do with ESX.

I only run a single windows VM on each xen box (unless the xen boxes started failing, then I'd run multiple VM's on one xen box until it was repaired). As such, I have 2 CPU cores dedicated to the dom0 (xen, physical box) and 4 cores dedicated to the windows VM (domU). I can drop the physical box back to a single core and add an additional core to windows, but I don't think that will make much of a difference. I don't tend to see high CPU load on the TS boxes... Mostly I see high memory utilisation more than CPU, and that is one of the reasons to upgrade them to Win2008 64bit so I can allocate more RAM. I'm hoping that even with the virtualisation overhead, the modern CPU's are faster than the previous physical machines which were about 5 years old.

So, overall, I haven't achieved anywhere near as much as I had hoped... I've changed the stripe_cache_size, and disabled HT on san1.

Seems to be faster than before, so will see how it goes today/this week.

Regards,
Adam


Regards,
Adam

--
Adam Goryachev
Website Managers
www.websitemanagers.com.au

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance - new kernel results - 5x SSD RAID5
  2013-03-03 15:19                                 ` Adam Goryachev
@ 2013-03-04  1:31                                   ` Phil Turmel
  2013-03-04  9:39                                     ` Adam Goryachev
  2013-03-04  5:25                                   ` Stan Hoeppner
  1 sibling, 1 reply; 131+ messages in thread
From: Phil Turmel @ 2013-03-04  1:31 UTC (permalink / raw)
  To: Adam Goryachev; +Cc: stan, Linux RAID

On 03/03/2013 10:19 AM, Adam Goryachev wrote:
> Phil Turmel <philip@turmel.org> wrote:
> 
>> On 03/02/2013 04:15 AM, Stan Hoeppner wrote:
>>> On 3/1/2013 10:06 AM, Adam Goryachev wrote:
>>
>>>> 15) "Make sure all LVs are aligned to the underlying md device
>>>> geometry. This will eliminate any possible alignment issues."
>>>> What does this mean? The drive partitions are now aligned properly,
>>>> but how does LVM allocate the blocks for each LV, and how do I
>>>> ensure it does so optimally? How do I even check this?
>>>
>>> I'm not an LVM user so I can't give you command lines.  But what I
>>> can tell you follows, and it is somewhat critical to RMW performance,
>>> more for rust but also for SSD to a lesser degree.
>> Run "dmsetup table" and look at the start sectors for your volumes:
>> Fast-Root: 0 314572800 linear 9:3 3072
>> This volume starts at sector 3072 (1.5MB) on /dev/sda3.  So the volume
>> alignment within LVM is 512K.
> 
> I see this (for the first LV)
> vg0-hostname: 0 204808192 linear 147:2 512
> So, I'm guessing mine is starting at 512, and is also aligned at 512k ?

Not quite.  Those size and offset numbers are shown in *sectors*, so you
have 256k alignment.  But that's exactly the minimum you need to match
your raid5 chunk size, so you are good for the moment.

> Also, pvdisplay tells me the PE Size is 4M, so I'm assuming that regardless of how the LV's are arranged, they will always be 512k aligned?

256k, but yeah.

> So, is that enough to be sure that this is not an issue?

It looks to me like you are good on alignment.

Phil

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance - new kernel results - 5x SSD RAID5
  2013-03-03 15:19                                 ` Adam Goryachev
  2013-03-04  1:31                                   ` Phil Turmel
@ 2013-03-04  5:25                                   ` Stan Hoeppner
  1 sibling, 0 replies; 131+ messages in thread
From: Stan Hoeppner @ 2013-03-04  5:25 UTC (permalink / raw)
  To: Adam Goryachev; +Cc: Phil Turmel, Linux RAID

On 3/3/2013 9:19 AM, Adam Goryachev wrote:
> Phil Turmel <philip@turmel.org> wrote:
...
>> Run "dmsetup table" and look at the start sectors for your volumes:
>> Fast-Root: 0 314572800 linear 9:3 3072
>> This volume starts at sector 3072 (1.5MB) on /dev/sda3.  So the volume
>> alignment within LVM is 512K.
> 
> I see this (for the first LV)
> vg0-hostname: 0 204808192 linear 147:2 512
> So, I'm guessing mine is starting at 512, and is also aligned at 512k ?
> 
> Also, pvdisplay tells me the PE Size is 4M, so I'm assuming that regardless of how the LV's are arranged, they will always be 512k aligned?
> 
> So, is that enough to be sure that this is not an issue?

If it starts at sector 512, that should be the first sector of the 2nd
md/RAID stripe, so the first sector of the LV is aligned.  204,808,192
sectors divided by a 256KB stripe width equals 400,016 stripes (a whole
number), so this partition is correctly aligned to the RAID device.  If
you have no gaps between this one and your other LVs, and each of them
is evenly divisible by 512 sectors, then they should all be aligned.

-- 
Stan


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance - new kernel results - 5x SSD RAID5
  2013-03-04  1:31                                   ` Phil Turmel
@ 2013-03-04  9:39                                     ` Adam Goryachev
  2013-03-04 12:41                                       ` Phil Turmel
  2013-03-04 12:42                                       ` Stan Hoeppner
  0 siblings, 2 replies; 131+ messages in thread
From: Adam Goryachev @ 2013-03-04  9:39 UTC (permalink / raw)
  To: Phil Turmel; +Cc: stan, Linux RAID

On 04/03/13 12:31, Phil Turmel wrote:
> On 03/03/2013 10:19 AM, Adam Goryachev wrote:
>> Phil Turmel <philip@turmel.org> wrote:
>>
>>> On 03/02/2013 04:15 AM, Stan Hoeppner wrote:
>>>> On 3/1/2013 10:06 AM, Adam Goryachev wrote:
>>>
>>>>> 15) "Make sure all LVs are aligned to the underlying md device
>>>>> geometry. This will eliminate any possible alignment issues."
>>>>> What does this mean? The drive partitions are now aligned properly,
>>>>> but how does LVM allocate the blocks for each LV, and how do I
>>>>> ensure it does so optimally? How do I even check this?
>>>>
>>>> I'm not an LVM user so I can't give you command lines.  But what I
>>>> can tell you follows, and it is somewhat critical to RMW performance,
>>>> more for rust but also for SSD to a lesser degree.
>>> Run "dmsetup table" and look at the start sectors for your volumes:
>>> Fast-Root: 0 314572800 linear 9:3 3072
>>> This volume starts at sector 3072 (1.5MB) on /dev/sda3.  So the volume
>>> alignment within LVM is 512K.
>>
>> I see this (for the first LV)
>> vg0-hostname: 0 204808192 linear 147:2 512
>> So, I'm guessing mine is starting at 512, and is also aligned at 512k ?
> 
> Not quite.  Those size and offset numbers are shown in *sectors*, so you
> have 256k alignment.  But that's exactly the minimum you need to match
> your raid5 chunk size, so you are good for the moment.

Probably a silly question, but how do you convert from the information:
Fast-Root: 0 314572800 linear 9:3 3072
vg0-hostname: 0 204808192 linear 147:2 512

My example was 512 sectors, assuming a sector size of 512 bytes, that
provides 256kB (as you advised above).

Your example was 3072 sectors, again assuming a sector size of 512
bytes, that becomes 1.5MB (as you stated above).

So how come my system has an alignment of 256k (1 x offset) while yours
has 512k (1.5M/3) ?

I'm assuming that in any case, as you suggested, the main thing is that
the answer I got (256kB offset) was equal to the MD chunk size (64kB)
multiplied by the number of data drives (4), or 256kB.


>> Also, pvdisplay tells me the PE Size is 4M, so I'm assuming that regardless of how the LV's are arranged, they will always be 512k aligned?
> 
> 256k, but yeah.

So LVM will not allocate any LV a block of space smaller than 4M, and
I'm assuming will always be on a 4M boundary from the beginning of the
device. Since 4MB is a mulitple of 256kB, then alignment is OK?

If the MD stripe size was larger, eg, if I added 2 more drives it would
become 64kB chunk x 6 data drives = 384kB. This would mean my LVM is no
longer properly aligned. The first block of 4MB would start at 256kB
which is smaller than the stripe size, and each 4MB block would most
likely not line up since 4MB is not divisible by 384kB?

So, if I ever choose to expand the array to include a larger number of
devices (as opposed to replacing all members with larger drives), what
would I need to do to fix all this up?

Re-partition to start the partition at a higher starting sector (such
that 4M / start sector * 512 produces an integer)?

That resolves the first LVM block, but to ensure all other blocks are
properly aligned, is the best answer to upgrade to 8 x data drives
(512kB stripe size)? Or is there some other magic solution I'm missing here?

>> So, is that enough to be sure that this is not an issue?
> It looks to me like you are good on alignment.

Thanks.

On 04/03/13 16:25, Stan Hoeppner wrote:
> If you have no gaps between this one and your other LVs, and each of 
> them is evenly divisible by 512 sectors, then they should all be
> aligned.

Given the 4MB size of the LVM blocks, does that automatically make this
true? I thought it did, but given your above comment, I'm unsure.

Thanks,
Adam


-- 
Adam Goryachev
Website Managers
www.websitemanagers.com.au

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance - new kernel results - 5x SSD RAID5
  2013-03-03 17:32                               ` Adam Goryachev
@ 2013-03-04 12:20                                 ` Stan Hoeppner
  2013-03-04 16:26                                   ` Adam Goryachev
  0 siblings, 1 reply; 131+ messages in thread
From: Stan Hoeppner @ 2013-03-04 12:20 UTC (permalink / raw)
  To: Adam Goryachev; +Cc: Linux RAID

On 3/3/2013 11:32 AM, Adam Goryachev wrote:
> Stan Hoeppner <stan@hardwarefreak.com> wrote:
>>> 1) Make sure stripe_cache_size is as least 8192.  If not:
>>> ~$ echo 8192 > /sys/block/md0/md/stripe_cache_size
>>> Currently using default 256.
> 
> Done now

I see below that this paid some dividend.  You could try increasing it
further and may get even better write throughput for this FIO test, but
keep in mind large stripe_cache_size values eat serious amounts of RAM:

Formula:  stripe_cache_size * 4096 bytes * drive_count = RAM usage.  For
your 5 drive array:

 8192 eats 160MB
16384 eats 320MB
32768 eats 640MB

Considering this is an iSCSI block IO server, dedicating 640MB of RAM to
md stripe cache isn't a bad idea at all if it seriously increases write
throughput (and without decreasing read throughput).  You don't need RAM
for buffer cache since you're not doing local file operations.  I'd even
go up to 131072 and eat 2.5GB of RAM if the performance is substantially
better than lower values.

Whatever value you choose, make it permanent by adding this entry to
root's crontab:

@reboot		/bin/echo 32768 > /sys/block/md0/md/stripe_cache_size

>>> top -b -n 60 -d 0.25|grep Cpu|sort -n > /some.dir/some.file
> 
> Done now
> 
> There seems to be only one row from the top output which is interesting:
> Cpu0 : 3.6%us, 71.4%sy, 0.0%ni, 0.0%id, 10.7%wa, 0.0%hi, 14.3%si, 0.0%,st
>
> Every other line had a high value for %id, or %wa, with lower values for %sy. This was during the second 'stage' of the fio run, earlier in the fio run there was no entry even close to showing the CPU as busy.

I expended the the time/effort walking you through all of this because I
want to analyze the complete output myself.  Would you please pastebin
it or send me the file?  Thanks.

> READ: io=131072MB, aggrb=2506MB/s, minb=2566MB/s, maxb=2566MB/s, mint=52303msec, maxt=52303msec
> WRITE: io=131072MB, aggrb=1262MB/s, minb=1292MB/s, maxb=1292MB/s, mint=103882msec, maxt=103882msec

Even better than I anticipated.  Nice, very nice.  2.3x the write
throughput.

Your last AIO single threaded streaming run:
READ:   2,200 MB/s
WRITE:    560 MB/s

Multi-threaded run with stripe cache optimization and compressible data:
READ:   2,500 MB/s
WRITE: *1,300 MB/s*

> Is this what I should be expecting now?

No, because this FIO test, as with the streaming test, isn't an accurate
model of your real daily IO workload, which entails much smaller, mixed
read/write random IOs.  But it does give a more accurate picture of the
peak aggregate write bandwidth of your array.

Once you have determined the optimal stripe_cache_size, you need to run
this FIO test again, multiple times, first with the LVM snapshot
enabled, and then with DRBD enabled.

The DRBD load on the array on san1 should be only reads at a maximum
rate of ~120MB/s as you have a single GbE link to the secondary.  This
is only 1/20th of the peak random read throughput of the array.  Your
prior sequential FIO runs showed a huge degradation in write performance
when DRBD was running.  This makes no sense, and should not be the case.
 You need to determine why DRBD on san1 is hammering write performance.

> To me, it looks like it is close enough, but if you think I should be able to get even faster, then I will certainly investigate further.

There may be some juice left on the table.  Experiment with
stripe_cache_size until you hit the sweet spot.  I'd use only power of 2
values.  If 32768 gives a decent gain, then try 65536, then 131072.  If
32768 doesn't gain, or decreases throughput, try 16384.  If 16384
doesn't yield decent gains or goes backward, stick with 8192.  Again,
you must manually stick the value as it doesn't survive reboots.
Easiest route is cron.

>>> 4) Try to connect the SSD's direct to the HBA, bypassing the hotswap
>>> device in case this is limiting to SATA II or similar.

<snippety sip snip>  Put palm to forehead.

FIO 2.5GB/s read speed.  2.5GBps / 5 = 500MB/s per drive, ergo your link
speed must be 6Gbps on each drive.  If it were 3Gbps you'd be limited to
300MB/s per drive, 1.5GB/s total.

> I'll have to find some software to run benchmarks within the windows VM's

FIO runs on Windows:  http://www.bluestop.org/fio/

> READ: io=32768MB, aggrb=237288KB/s, minb=237288KB/s, maxb=237288KB/s, mint=141408msec, maxt=141408msec
> WRITE: io=32768MB, aggrb=203307KB/s, minb=203307KB/s, maxb=203307KB/s, mint=165043msec, maxt=165043msec
> 
> So, 230MB/s read and 200MB/s write, using 2 x 1Gbps ethernet seems pretty good. 

So little left it's not worth optimization time.

> The TS VM's have dedicated 4 cores, and the physical machines have dedicated 2 cores. I don't think the TS are CPU limited, this was for me to watch this on the DC to ensure it's single CPU was not limiting performance

I understand the DC issue.  I was addressing a different issue here.  If
the TS VMs have 4 cores there's nothing more you can do.

> I only run a single windows VM on each xen box (unless the xen boxes started failing, then I'd run multiple VM's on one xen box until it was repaired). 

From your previous msgs early in the thread I was under the impression
you were running two TS VMs per Xen box.  I must have misread something.

> As such, I have 2 CPU cores dedicated to the dom0 (xen, physical box) and 4 cores dedicated to the windows VM (domU). I can drop the physical box back to a single core and add an additional core to windows, but I don't think that will make much of a difference. I don't tend to see high CPU load on the TS boxes... 

You may see less than peak averages in Munin due to the capture interval
of munin-node, but still have momentary peaks that bog users down.  I'm
not saying this is the case, but it's probably worth investigating
further, via methods other than studying Munin graphs.

> Mostly I see high memory utilisation more than CPU, and that is one of the reasons to upgrade them to Win2008 64bit so I can allocate more RAM. I'm hoping that even with the virtualisation overhead, the modern CPU's are faster than the previous physical machines which were about 5 years old.

MS stupidity:		x86 	x64

W2k3 Server Standard	4GB	 32GB
XP			4GB	128GB

> So, overall, I haven't achieved anywhere near as much as I had hoped... 

You doubled write throughput to 1.3GB/s, at least WRT FIO.  That's one
fairly significant achievement.

> I've changed the stripe_cache_size, and disabled HT on san1.

Remember to test other sizes and make the optimum value permanent.

> Seems to be faster than before, so will see how it goes today/this week.

The only optimization since your last FIO test was increasing
stripe_cache_size (the rest of the FIO throughput increase was simply
due to changing the workload profile and using non random data buffers).
 The buffer difference:

stripe_cache_size	buffer space		full stripes buffered

  256 (default)		  5 MB			  16
 8192			160 MB			 512

To find out how much of the 732MB/s write throughput increase is due to
buffering 512 stripes instead of 16, simply change it back to 256,
re-run my FIO job file, and subtract the write result from 1292MB/s.

-- 
Stan


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance - new kernel results - 5x SSD RAID5
  2013-03-04  9:39                                     ` Adam Goryachev
@ 2013-03-04 12:41                                       ` Phil Turmel
  2013-03-04 12:42                                       ` Stan Hoeppner
  1 sibling, 0 replies; 131+ messages in thread
From: Phil Turmel @ 2013-03-04 12:41 UTC (permalink / raw)
  To: Adam Goryachev; +Cc: stan, Linux RAID

On 03/04/2013 04:39 AM, Adam Goryachev wrote:

> Probably a silly question, but how do you convert from the
> information: Fast-Root: 0 314572800 linear 9:3 3072 vg0-hostname: 0
> 204808192 linear 147:2 512
> 
> My example was 512 sectors, assuming a sector size of 512 bytes,
> that provides 256kB (as you advised above).
> 
> Your example was 3072 sectors, again assuming a sector size of 512 
> bytes, that becomes 1.5MB (as you stated above).
> 
> So how come my system has an alignment of 256k (1 x offset) while
> yours has 512k (1.5M/3) ?

Alignment is the greatest common multiple of a power of two that the
beginning of the data area falls on.   In practice, that means find the
lowest "1" bit in the binary representation of the starting offset.  I
personally switch to hex, then mentally pick the lowest "1" bit in the
first non-zero digit from the right.

> I'm assuming that in any case, as you suggested, the main thing is
> that the answer I got (256kB offset) was equal to the MD chunk size
> (64kB) multiplied by the number of data drives (4), or 256kB.

Yes.

>>> Also, pvdisplay tells me the PE Size is 4M, so I'm assuming that
>>> regardless of how the LV's are arranged, they will always be 512k
>>> aligned?
>> 
>> 256k, but yeah.
> 
> So LVM will not allocate any LV a block of space smaller than 4M,
> and I'm assuming will always be on a 4M boundary from the beginning
> of the device. Since 4MB is a mulitple of 256kB, then alignment is
> OK?

It will not be on a 4M boundary.  The first PE is at a 256k offset, so
any multiple of 4M added to that will also be 256k aligned.  For you,
that's fine.

> If the MD stripe size was larger, eg, if I added 2 more drives it
> would become 64kB chunk x 6 data drives = 384kB. This would mean my
> LVM is no longer properly aligned. The first block of 4MB would start
> at 256kB which is smaller than the stripe size, and each 4MB block
> would most likely not line up since 4MB is not divisible by 384kB?

Then it gets complicated.  When the # of data drives in parity raid
isn't a power of two, you generally cannot make higher layers
consistently align with the stripe boundaries.  The best you can do is
align to the greatest common power of two of the stripe size.  For your
example, that would be 128k.

> So, if I ever choose to expand the array to include a larger number
> of devices (as opposed to replacing all members with larger drives),
> what would I need to do to fix all this up?
>
> Re-partition to start the partition at a higher starting sector
> (such that 4M / start sector * 512 produces an integer)?

pvcreate can be told what alignment to use.  It will round up to its
requirements, though.  vgcreate can be told what physical extent size to
use.  So you have a great deal of control over these behaviors.  It
can't deal with an odd stripe size, though.

> That resolves the first LVM block, but to ensure all other blocks
> are properly aligned, is the best answer to upgrade to 8 x data
> drives (512kB stripe size)? Or is there some other magic solution I'm
> missing here?

You want data alignment to be greater than or equal to stripe alignment.
 Going to 8 drives would break alignment for your existing PV.

>>> So, is that enough to be sure that this is not an issue?
>> It looks to me like you are good on alignment.
> 
> Thanks.
> 
> On 04/03/13 16:25, Stan Hoeppner wrote:
>> If you have no gaps between this one and your other LVs, and each
>> of them is evenly divisible by 512 sectors, then they should all
>> be aligned.
> 
> Given the 4MB size of the LVM blocks, does that automatically make
> this true? I thought it did, but given your above comment, I'm
> unsure.

Alignment is the lowest power of two of all of the offsets and sizing
multiples.  The 4M PE size is larger than any of the other alignment
factors, so it drops out of the analysis.

HTH,

Phil

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance - new kernel results - 5x SSD RAID5
  2013-03-04  9:39                                     ` Adam Goryachev
  2013-03-04 12:41                                       ` Phil Turmel
@ 2013-03-04 12:42                                       ` Stan Hoeppner
  1 sibling, 0 replies; 131+ messages in thread
From: Stan Hoeppner @ 2013-03-04 12:42 UTC (permalink / raw)
  To: Adam Goryachev; +Cc: Phil Turmel, Linux RAID

On 3/4/2013 3:39 AM, Adam Goryachev wrote:

> On 04/03/13 16:25, Stan Hoeppner wrote:
>> If you have no gaps between this one and your other LVs, and each of 
>> them is evenly divisible by 512 sectors, then they should all be
>> aligned.
> 
> Given the 4MB size of the LVM blocks, does that automatically make this
> true? I thought it did, but given your above comment, I'm unsure.

512 sectors is the size of your md/RAID stripe.

512 bytes/sector * 512 sectors = 262,144 bytes = 256 KB

4 MB = 4 * 1,048,576 = 4,194,304 bytes

4,194,304 bytes / 262,144 bytes = 16

So there are exactly 16 md/RAID stripes sitting 'under' each 4MB LVM block.


WRT to your question of changing the stripe width, due to drive
expansion, to a value not evenly divisible into 4MB, I'd say you'd need
to change your chunk size during the expansion reshape so that your new
stripe width is evenly divisible into 4MB.  You'll need to plan this, as
with some drive counts it may not be possible.  Is it possible to change
the LVM block size to fit the new stripe width?

-- 
Stan


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance - new kernel results - 5x SSD RAID5
  2013-03-04 12:20                                 ` Stan Hoeppner
@ 2013-03-04 16:26                                   ` Adam Goryachev
  2013-03-05  9:30                                     ` RAID performance - 5x SSD RAID5 - effects of stripe cache sizing Stan Hoeppner
  0 siblings, 1 reply; 131+ messages in thread
From: Adam Goryachev @ 2013-03-04 16:26 UTC (permalink / raw)
  To: stan; +Cc: Linux RAID

On 04/03/13 23:20, Stan Hoeppner wrote:
> On 3/3/2013 11:32 AM, Adam Goryachev wrote:
>> Stan Hoeppner <stan@hardwarefreak.com> wrote:
>>>> 1) Make sure stripe_cache_size is as least 8192.  If not:
>>>> ~$ echo 8192 > /sys/block/md0/md/stripe_cache_size
>>>> Currently using default 256.
>>
>> Done now
> 
> I see below that this paid some dividend.  You could try increasing it
> further and may get even better write throughput for this FIO test, but
> keep in mind large stripe_cache_size values eat serious amounts of RAM:
> 
> Formula:  stripe_cache_size * 4096 bytes * drive_count = RAM usage.  For
> your 5 drive array:
> 
>  8192 eats 160MB
> 16384 eats 320MB
> 32768 eats 640MB
> 
> Considering this is an iSCSI block IO server, dedicating 640MB of RAM to
> md stripe cache isn't a bad idea at all if it seriously increases write
> throughput (and without decreasing read throughput).  You don't need RAM
> for buffer cache since you're not doing local file operations.  I'd even
> go up to 131072 and eat 2.5GB of RAM if the performance is substantially
> better than lower values.
> 
> Whatever value you choose, make it permanent by adding this entry to
> root's crontab:
> 
> @reboot		/bin/echo 32768 > /sys/block/md0/md/stripe_cache_size

Already added to /etc/rc.local along with the config to set the deadline
scheduler for each of the RAID drives.

I will certainly test with higher numbers, I've got 8GB of RAM, and
there is really not much else that needs the RAM for anything near as
important as this. I'd honestly be happy to dedicate at least 4 or 5GB
of RAM if it was going to improve performance... I'll try values up to
262144 which should be 5120MB of RAM, leaving well over 2GB for the OS
and minor monitoring/etc...
Current memory usage:
             total       used       free     shared    buffers     cached
Mem:       7904320    1540692    6363628          0     130284     856148
-/+ buffers/cache:     554260    7350060
Swap:      3939324          0    3939324

Will advise results of testing ...

>>>> top -b -n 60 -d 0.25|grep Cpu|sort -n > /some.dir/some.file
>>
>> Done now
>>
>> There seems to be only one row from the top output which is interesting:
>> Cpu0 : 3.6%us, 71.4%sy, 0.0%ni, 0.0%id, 10.7%wa, 0.0%hi, 14.3%si, 0.0%,st
>>
>> Every other line had a high value for %id, or %wa, with lower values for %sy. This was during the second 'stage' of the fio run, earlier in the fio run there was no entry even close to showing the CPU as busy.
> 
> I expended the the time/effort walking you through all of this because I
> want to analyze the complete output myself.  Would you please pastebin
> it or send me the file?  Thanks.

I'll send the file off-list due to size... I was working on-site with a
windows box, I wasn't logged into my email from there...

>> READ: io=131072MB, aggrb=2506MB/s, minb=2566MB/s, maxb=2566MB/s, mint=52303msec, maxt=52303msec
>> WRITE: io=131072MB, aggrb=1262MB/s, minb=1292MB/s, maxb=1292MB/s, mint=103882msec, maxt=103882msec
> 
> Even better than I anticipated.  Nice, very nice.  2.3x the write
> throughput.
> 
> Your last AIO single threaded streaming run:
> READ:   2,200 MB/s
> WRITE:    560 MB/s
> 
> Multi-threaded run with stripe cache optimization and compressible data:
> READ:   2,500 MB/s
> WRITE: *1,300 MB/s*
> 
>> Is this what I should be expecting now?
> 
> No, because this FIO test, as with the streaming test, isn't an accurate
> model of your real daily IO workload, which entails much smaller, mixed
> read/write random IOs.  But it does give a more accurate picture of the
> peak aggregate write bandwidth of your array.
> 
> Once you have determined the optimal stripe_cache_size, you need to run
> this FIO test again, multiple times, first with the LVM snapshot
> enabled, and then with DRBD enabled.
> 
> The DRBD load on the array on san1 should be only reads at a maximum
> rate of ~120MB/s as you have a single GbE link to the secondary.  This
> is only 1/20th of the peak random read throughput of the array.  Your
> prior sequential FIO runs showed a huge degradation in write performance
> when DRBD was running.  This makes no sense, and should not be the case.
>  You need to determine why DRBD on san1 is hammering write performance.

I've re-run the fio test from above just now, except all the VM's are
online, should be mostly idle, and also the secondary DRBD is connected:
stripe_cache_size = 2048
>    READ: io=131072MB, aggrb=2502MB/s, minb=2562MB/s, maxb=2562MB/s, mint=52397msec, maxt=52397msec
>   WRITE: io=131072MB, aggrb=994MB/s, minb=1018MB/s, maxb=1018MB/s, mint=131803msec, maxt=131803msec

stripe_cache_size = 4096
>    READ: io=131072MB, aggrb=2504MB/s, minb=2564MB/s, maxb=2564MB/s, mint=52348msec, maxt=52348msec
>   WRITE: io=131072MB, aggrb=1590MB/s, minb=1628MB/s, maxb=1628MB/s, mint=82455msec, maxt=82455msec

stripe_cache_size = 8192
>    READ: io=131072MB, aggrb=2487MB/s, minb=2547MB/s, maxb=2547MB/s, mint=52697msec, maxt=52697msec
>   WRITE: io=131072MB, aggrb=1521MB/s, minb=1557MB/s, maxb=1557MB/s, mint=86188msec, maxt=86188msec

stripe_cache_size = 16384
>    READ: io=131072MB, aggrb=2494MB/s, minb=2554MB/s, maxb=2554MB/s, mint=52556msec, maxt=52556msec
>   WRITE: io=131072MB, aggrb=1368MB/s, minb=1401MB/s, maxb=1401MB/s, mint=95779msec, maxt=95779msec

stripe_cache_size = 32768
>    READ: io=131072MB, aggrb=2489MB/s, minb=2549MB/s, maxb=2549MB/s, mint=52661msec, maxt=52661msec
>   WRITE: io=131072MB, aggrb=1138MB/s, minb=1165MB/s, maxb=1165MB/s, mint=115209msec, maxt=115209msec

(let me know if you want the full fio output....)
This seems to show that DRBD did not slow things down at all... I don't
remember exactly when I did the previous fio tests with drbd connected,
but perhaps I've made changes to the drbd config since then and/or
upgraded from the debian stable drbd to 8.3.15

Let's re-run the above tests with DRBD stopped:
stripe_cache_size = 256
>    READ: io=131072MB, aggrb=2496MB/s, minb=2556MB/s, maxb=2556MB/s, mint=52508msec, maxt=52508msec
>   WRITE: io=131072MB, aggrb=928148KB/s, minb=950424KB/s, maxb=950424KB/s, mint=144608msec, maxt=144608msec

stripe_cache_size = 512
>    READ: io=131072MB, aggrb=2497MB/s, minb=2557MB/s, maxb=2557MB/s, mint=52484msec, maxt=52484msec
>   WRITE: io=131072MB, aggrb=978170KB/s, minb=978MB/s, maxb=978MB/s, mint=137213msec, maxt=137213msec

stripe_cache_size = 2048
>    READ: io=131072MB, aggrb=2502MB/s, minb=2562MB/s, maxb=2562MB/s, mint=52382msec, maxt=52382msec
>   WRITE: io=131072MB, aggrb=996MB/s, minb=1020MB/s, maxb=1020MB/s, mint=131631msec, maxt=131631msec

stripe_cache_size = 4096
>    READ: io=131072MB, aggrb=2498MB/s, minb=2558MB/s, maxb=2558MB/s, mint=52465msec, maxt=52465msec
>   WRITE: io=131072MB, aggrb=1596MB/s, minb=1634MB/s, maxb=1634MB/s, mint=82126msec, maxt=82126msec

stripe_cache_size = 8192
>    READ: io=131072MB, aggrb=2498MB/s, minb=2558MB/s, maxb=2558MB/s, mint=52469msec, maxt=52469msec
>   WRITE: io=131072MB, aggrb=1514MB/s, minb=1550MB/s, maxb=1550MB/s, mint=86565msec, maxt=86565msec

stripe_cache_size = 16384
>    READ: io=131072MB, aggrb=2482MB/s, minb=2542MB/s, maxb=2542MB/s, mint=52807msec, maxt=52807msec
>   WRITE: io=131072MB, aggrb=1377MB/s, minb=1410MB/s, maxb=1410MB/s, mint=95191msec, maxt=95191msec

stripe_cache_size = 32768
>    READ: io=131072MB, aggrb=2498MB/s, minb=2557MB/s, maxb=2557MB/s, mint=52481msec, maxt=52481msec
>   WRITE: io=131072MB, aggrb=1139MB/s, minb=1166MB/s, maxb=1166MB/s, mint=115102msec, maxt=115102msec

I was actually going to use 65536 as well, but actually that value
doesn't work at all:
echo 65536 > /sys/block/md1/md/stripe_cache_size
-bash: echo: write error: Invalid argument

So, it looks like the ideal value is actually smaller (4096) although
there is not much difference between 8192 and 4096. It seems strange
that a larger cache size will actually reduce performance... I'll change
to 4096 for the time being, unless you think "real world" performance
might be better with 8192?

Here are the results of re-running fio using the previous config (with
drbd connected with the stripe_cache_size = 8192):
>    READ: io=4096MB, aggrb=2244MB/s, minb=2298MB/s, maxb=2298MB/s, mint=1825msec, maxt=1825msec
>   WRITE: io=4096MB, aggrb=494903KB/s, minb=506780KB/s, maxb=506780KB/s, mint=8475msec, maxt=8475msec

Perhaps the old fio test just isn't as well suited to the way drbd
handles things. Though the issue would be what sort of data the real
users are doing, because if that matches the old fio test or the new fio
test, it makes a big difference.

So, it looks like it is the stripe_cache_size that is affecting
performance, and that DRBD makes no difference whether it is connected
or not. Possibly removing it completely would increase performance
somewhat, but since I actually do need it, and that is somewhat
destructive, I won't try that :)

>> To me, it looks like it is close enough, but if you think I should be able to get even faster, then I will certainly investigate further.
> 
> There may be some juice left on the table.  Experiment with
> stripe_cache_size until you hit the sweet spot.  I'd use only power of 2
> values.  If 32768 gives a decent gain, then try 65536, then 131072.  If
> 32768 doesn't gain, or decreases throughput, try 16384.  If 16384
> doesn't yield decent gains or goes backward, stick with 8192.  Again,
> you must manually stick the value as it doesn't survive reboots.
> Easiest route is cron.

Will stick with 4096 for the moment based on the above results.

>>>> 4) Try to connect the SSD's direct to the HBA, bypassing the hotswap
>>>> device in case this is limiting to SATA II or similar.
> 
> <snippety sip snip>  Put palm to forehead.
> 
> FIO 2.5GB/s read speed.  2.5GBps / 5 = 500MB/s per drive, ergo your link
> speed must be 6Gbps on each drive.  If it were 3Gbps you'd be limited to
> 300MB/s per drive, 1.5GB/s total.

Of course, thank you for the blindingly obvious :)

>> I'll have to find some software to run benchmarks within the windows VM's
> 
> FIO runs on Windows:  http://www.bluestop.org/fio/

Will check into that, it will be the ultimate end-to-end test.... Also,
I can test the difference between the windows 2003 and windows 2000 to
see if there is any difference there....

>> Mostly I see high memory utilisation more than CPU, and that is one of the reasons to upgrade them to Win2008 64bit so I can allocate more RAM. I'm hoping that even with the virtualisation overhead, the modern CPU's are faster than the previous physical machines which were about 5 years old.
> 
> MS stupidity:		x86 	x64
> 
> W2k3 Server Standard	4GB	 32GB
> XP			4GB	128GB

Hmmm, good point, I realised I could try and upgrade to the x64 windows
2003, but I think I'd prefer to just move up to 2008 x64 (or 2012)...
For now, I'll just keep using my hacky 4GB RAM drive for the pagefile...

>> So, overall, I haven't achieved anywhere near as much as I had hoped... 
> You doubled write throughput to 1.3GB/s, at least WRT FIO.  That's one
> fairly significant achievement.

I meant I hadn't crossed off as many items from my list of things to
do... Not that I hadn't improved performance significantly :)

>> Seems to be faster than before, so will see how it goes today/this week.
> 
> The only optimization since your last FIO test was increasing
> stripe_cache_size (the rest of the FIO throughput increase was simply
> due to changing the workload profile and using non random data buffers).
>  The buffer difference:
> stripe_cache_size	buffer space		full stripes buffered
>   256 (default)		  5 MB			  16
>  8192			160 MB			 512
> 
> To find out how much of the 732MB/s write throughput increase is due to
> buffering 512 stripes instead of 16, simply change it back to 256,
> re-run my FIO job file, and subtract the write result from 1292MB/s.

So, running your FIO job file with the original 256 give a write speed
of 950MB/s and the previous FIO file gives 509MB/s. So it would seem the
increase in stripe_cache_size from 256 to 4096 give an increase in your
FIO job from 950MB/s to 1634MB/s which is a significant speed boost. I
must wonder why we have a default of 256 when this can make such a
significant performance improvement? A value of 4096 with a 5 drive raid
array is only 80MB of cache, I suspect very few users with a 5 drive
RAID array would be concerned about losing 80MB of RAM, and a 2 drive
RAID array would only use 32MB ...

Regards,
Adam

-- 
Adam Goryachev
Website Managers
www.websitemanagers.com.au

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance -  5x SSD RAID5 - effects of stripe cache sizing
  2013-03-04 16:26                                   ` Adam Goryachev
@ 2013-03-05  9:30                                     ` Stan Hoeppner
  2013-03-05 15:53                                       ` Adam Goryachev
  0 siblings, 1 reply; 131+ messages in thread
From: Stan Hoeppner @ 2013-03-05  9:30 UTC (permalink / raw)
  To: Adam Goryachev; +Cc: Linux RAID

On 3/4/2013 10:26 AM, Adam Goryachev wrote:

>> Whatever value you choose, make it permanent by adding this entry to
>> root's crontab:
>>
>> @reboot		/bin/echo 32768 > /sys/block/md0/md/stripe_cache_size
> 
> Already added to /etc/rc.local along with the config to set the deadline
> scheduler for each of the RAID drives.

You should be using noop for SSD, not deadline.  noop may improve your
FIO throughput, nad real workload, even further.

Also, did you verify with a reboot that stripe_cache_size is actually
being set correctly at startup?  If it's not working as assumed you'll
be losing several hundred MB/s of write throughput at the next reboot.
Something this critical should always be tested and verified.

> stripe_cache_size = 4096
>>    READ: io=131072MB, aggrb=2504MB/s, minb=2564MB/s, maxb=2564MB/s, mint=52348msec, maxt=52348msec
>>   WRITE: io=131072MB, aggrb=1590MB/s, minb=1628MB/s, maxb=1628MB/s, mint=82455msec, maxt=82455msec

Wow, we're up to 1.6 GB/s data throughput, 2 GB/s total md device
throughput.  That's 407MB/s per SSD.  This is much more inline with what
one would expect from a RAID5 using 5 large, fast SandForce SSDs.  This
is 80% of the single drive streaming write throughput of this SSD model,
as tested by Anandtech, Tom's, and others.

I'm a bit surprised we're achieving 2 GB/s parity write throughput with
the single threaded RAID5 driver on one core.  Those 3.3GHz Ive Bridge
cores are stouter than I thought.  Disabling HT probably helped a bit
here.  I'm anxious to see the top output file for this run (if you made
one--you should for each and every FIO run).  Surely we're close to
peaking the core here.

> stripe_cache_size = 8192
>>    READ: io=131072MB, aggrb=2487MB/s, minb=2547MB/s, maxb=2547MB/s, mint=52697msec, maxt=52697msec
>>   WRITE: io=131072MB, aggrb=1521MB/s, minb=1557MB/s, maxb=1557MB/s, mint=86188msec, maxt=86188msec

Interesting.  4096/8192 are both higher by ~300MB/s compared to the
previous 1292MB/s you posted for 8192.  Some other workload must have
been active during the previous run, or something else has changed.

> stripe_cache_size = 16384
>>    READ: io=131072MB, aggrb=2494MB/s, minb=2554MB/s, maxb=2554MB/s, mint=52556msec, maxt=52556msec
>>   WRITE: io=131072MB, aggrb=1368MB/s, minb=1401MB/s, maxb=1401MB/s, mint=95779msec, maxt=95779msec
> 
> stripe_cache_size = 32768
>>    READ: io=131072MB, aggrb=2489MB/s, minb=2549MB/s, maxb=2549MB/s, mint=52661msec, maxt=52661msec
>>   WRITE: io=131072MB, aggrb=1138MB/s, minb=1165MB/s, maxb=1165MB/s, mint=115209msec, maxt=115209msec

This is why you test, and test, and test when tuning for performance.
4096 seems to be your sweet spot.

> (let me know if you want the full fio output....)

No, the summary is fine.  What's more more valuable to have the top
output file for each run so I can see what's going on.  At 2 GB/s of
throughput your interrupt rate should be pretty high, and I'd like to
see the IRQ spread across the cores, as well as the RAID5 thread load,
among other things.  I haven't yet looked at the file you sent, but I'm
guessing it doesn't include this 1.6GB/s run.  I'm really interested in
seeing that one, and the ones for 16384 and 32768.  WRT the latter two,
I'm curious whether the much larger tables are causing excessive CPU
burn, which may in turn be what lowers throughput.

> This seems to show that DRBD did not slow things down at all... I don't

I noticed.

> remember exactly when I did the previous fio tests with drbd connected,
> but perhaps I've made changes to the drbd config since then and/or
> upgraded from the debian stable drbd to 8.3.15

Maybe it wasn't actively syncing when you made these FIO runs.

> Let's re-run the above tests with DRBD stopped:
...
> stripe_cache_size = 4096
>>    READ: io=131072MB, aggrb=2498MB/s, minb=2558MB/s, maxb=2558MB/s, mint=52465msec, maxt=52465msec
>>   WRITE: io=131072MB, aggrb=1596MB/s, minb=1634MB/s, maxb=1634MB/s, mint=82126msec, maxt=82126msec
> 
> stripe_cache_size = 8192
>>    READ: io=131072MB, aggrb=2498MB/s, minb=2558MB/s, maxb=2558MB/s, mint=52469msec, maxt=52469msec
>>   WRITE: io=131072MB, aggrb=1514MB/s, minb=1550MB/s, maxb=1550MB/s, mint=86565msec, maxt=86565msec
...

Numbers are identical.  Either BRBD wasn't actually copying anything
during the previous FIO run, its nice level changed, its
configuration/behavior changed with the new version, or something.
Whatever the reason, it appears to be putting no load on the array.

> So, it looks like the ideal value is actually smaller (4096) although
> there is not much difference between 8192 and 4096. It seems strange
> that a larger cache size will actually reduce performance... I'll change

It's not strange at all, but expected.  As a table gets larger it takes
more CPU cycles to manage it and more memory bandwidth; your cache miss
rate increases, etc.  At a certain point this overhead becomes
detrimental instead of beneficial.  In your case the size of the cache
table outweighs the overhead and yields increased performance up to 80MB
table size.  At 160MB and above the size of the table creates more
overhead than performance benefit.

This is what system testing/tuning is all about.

> to 4096 for the time being, unless you think "real world" performance
> might be better with 8192?

These FIO runs are hitting your IO subsystem much harder than your real
workloads every will.  Stick with 4096.

> Here are the results of re-running fio using the previous config (with
> drbd connected with the stripe_cache_size = 8192):
>>    READ: io=4096MB, aggrb=2244MB/s, minb=2298MB/s, maxb=2298MB/s, mint=1825msec, maxt=1825msec
>>   WRITE: io=4096MB, aggrb=494903KB/s, minb=506780KB/s, maxb=506780KB/s, mint=8475msec, maxt=8475msec
> 
> Perhaps the old fio test just isn't as well suited to the way drbd
> handles things. Though the issue would be what sort of data the real
> users are doing, because if that matches the old fio test or the new fio
> test, it makes a big difference.

The significantly lower throughput of the "old" FIO job has *nothing* to
do with DRBD.  It has everything to do with the parameters of the job
file.  I thought I explained the differences previously.  If not, here
you go:

1.  My FIO job has 16 workers submitting IO in parallel
    The "old" job has a single worker submitting serially
    -- both are using AIO

2.  My FIO job uses zeroed buffers, allowing the SSDs to compress data
    The old job uses randomized data, thus SSD compression is lower

3.  My FIO job does 256KB IOs, each one filling a RAID stripe
    The old job does 64KB IOs, each one filling one chunk

4.  My FIO job does random IOs, spreading the writes over the volume
    The old job does serial IOs
    -- the SandForce controllers have 8 channels and can write to all 8
       in parallel.  Writing randomly creates more opportunity for the
       controller to write multiple channels concurrently

My FIO job simulates a large multiuser heavy concurrent IO workload.  It
creates 16 threads, 4 running on each core.  In parallel, they submit a
massive amount of random, stripe width writes, containing uniform data,
asynchronously, to the block device, here the md/RAID5 device.  Doing
this ensures the IO pipeline is completely full all the time, with zero
delays between submissions.

The "old" FIO job creates a single thread which submits chunk size
overlapping writes asynchronously via the io_submit() system call
(libaio).  Contrary to apparently popular belief, this does not allow
one to send a continuous stream of overlapping writes from a single
thread with no time slice gaps between the system calls.

My FIO job threads use io_submit() as well, but there are 16 threads
submitting in parallel, leaving no time gaps between IO submissions,
with massive truly overlapping IOs.  This parallel job could be run with
any number of FIO engines with the same results.  I stuck with AIO for
direct comparison as we're doing here.

Because it is sending so many more IOs per unit time than the single
threaded job, the larger md stripe cache is of great benefit.  The
single threaded job isn't submitting sufficient IOs per unit for the
larger stripe cache to make a difference.

The takeaway here is not that my FIO job makes the SSD RAID faster.  It
simply pushes a sufficient amount of IO to demonstrate the intrinsic
high throughput the array is capable of.  For those fond of car
analogies:  the old FIO test is barely pushing on the throttle;  my FIO
test is hammering the pedal to the floor.  Same car, same speed
potential, just different amounts of load applied to the pedal.

> So, it looks like it is the stripe_cache_size that is affecting
> performance, and that DRBD makes no difference whether it is connected
> or not. Possibly removing it completely would increase performance
> somewhat, but since I actually do need it, and that is somewhat
> destructive, I won't try that :)

I'd do more investigating of this.  DRBD can't put zero load on the
array if it's doing work.  Given it's a read only workload, it's
possible the increased stripe cache is allowing full throttle writes
while doing 100MB/s of reads, without writes being impacted.  You'll
need to look deeper into the md statistics and/or monitor iostat, etc,
during runs with DRBD active and actually moving data.

> Will stick with 4096 for the moment based on the above results.

That's my recommendation.

>> FIO runs on Windows:  http://www.bluestop.org/fio/
> 
> Will check into that, it will be the ultimate end-to-end test.... Also,

Yes, it will.  As long as you're running at least 16-32 threads per TS
client to overcome TCP/iSCSI over GbE latency, and the lack of AIO on
Windows.  And you can't simply reuse the same job file.  The docs tell
you which engine, and other settings, to use for Windows.

> Hmmm, good point, I realised I could try and upgrade to the x64 windows
> 2003, but I think I'd prefer to just move up to 2008 x64 (or 2012)...
> For now, I'll just keep using my hacky 4GB RAM drive for the pagefile...

Or violate BCP and run two TS instances per Xen, or even four, with the
appropriate number of users per each.  KSM will consolidate all the
Windows and user application read only files (DLLs, exes, etc), yielding
much more free real memory than with a single Windows TS instance.
AFAIK Windows has no memory merging so you can't over commit memory
other than with the page file, which is horribly less efficient than KSM.

> I meant I hadn't crossed off as many items from my list of things to
> do... Not that I hadn't improved performance significantly :)

I know, was just poking you in the ribs. ;)

>> To find out how much of the 732MB/s write throughput increase is due to
>> buffering 512 stripes instead of 16, simply change it back to 256,
>> re-run my FIO job file, and subtract the write result from 1292MB/s.
> 
> So, running your FIO job file with the original 256 give a write speed
> of 950MB/s and the previous FIO file gives 509MB/s. So it would seem the
> increase in stripe_cache_size from 256 to 4096 give an increase in your
> FIO job from 950MB/s to 1634MB/s which is a significant speed boost. I

72 percent increase with this synthetic workload, by simply increasing
the stripe cache.  Not bad eh?  This job doesn't present an accurate
picture of real world performance though, as most synthetic tests don't.

Get DRBD a hump'n and your LVM snapshot(s) in place, all the normal
server side load, then fire up the 32 thread FIO test on each TS VM to
simulate users (I could probably knock out this job file if you like).
Then monitor the array throughput with iostat or similar.  This would be
about as close to peak real world load as you can get.

> must wonder why we have a default of 256 when this can make such a
> significant performance improvement?  A value of 4096 with a 5 drive raid
> array is only 80MB of cache, I suspect very few users with a 5 drive
> RAID array would be concerned about losing 80MB of RAM, and a 2 drive
> RAID array would only use 32MB ...

The stripe cache has nothing to do with device count, but hardware
throughput.  Did you happen to notice what occurred when you increased
cache size past your 4096 sweet spot to 32768?  Throughput dropped by
~500MB/s, almost 1/3rd.  Likewise, for the slow rust array whose sweet
spot is 512, making the default 4096 will decrease his throughput, and
eat 80MB RAM for nothing.  Defaults are chosen to work best with the
lowest common denominator hardware, not the Ferrari.

-- 
Stan


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance -  5x SSD RAID5 - effects of stripe cache sizing
  2013-03-05  9:30                                     ` RAID performance - 5x SSD RAID5 - effects of stripe cache sizing Stan Hoeppner
@ 2013-03-05 15:53                                       ` Adam Goryachev
  2013-03-07  7:36                                         ` Stan Hoeppner
  0 siblings, 1 reply; 131+ messages in thread
From: Adam Goryachev @ 2013-03-05 15:53 UTC (permalink / raw)
  To: stan; +Cc: Linux RAID

On 05/03/13 20:30, Stan Hoeppner wrote:
> On 3/4/2013 10:26 AM, Adam Goryachev wrote:
> 
>>> Whatever value you choose, make it permanent by adding this entry to
>>> root's crontab:
>>>
>>> @reboot		/bin/echo 32768 > /sys/block/md0/md/stripe_cache_size
>>
>> Already added to /etc/rc.local along with the config to set the deadline
>> scheduler for each of the RAID drives.
> 
> You should be using noop for SSD, not deadline.  noop may improve your
> FIO throughput, nad real workload, even further.

OK, done now...

> Also, did you verify with a reboot that stripe_cache_size is actually
> being set correctly at startup?  If it's not working as assumed you'll
> be losing several hundred MB/s of write throughput at the next reboot.
> Something this critical should always be tested and verified.

Will do, thanks for the nudge...

>> stripe_cache_size = 4096
>>>    READ: io=131072MB, aggrb=2504MB/s, minb=2564MB/s, maxb=2564MB/s, mint=52348msec, maxt=52348msec
>>>   WRITE: io=131072MB, aggrb=1590MB/s, minb=1628MB/s, maxb=1628MB/s, mint=82455msec, maxt=82455msec
> 
> Wow, we're up to 1.6 GB/s data throughput, 2 GB/s total md device
> throughput.  That's 407MB/s per SSD.  This is much more inline with what
> one would expect from a RAID5 using 5 large, fast SandForce SSDs.  This
> is 80% of the single drive streaming write throughput of this SSD model,
> as tested by Anandtech, Tom's, and others.
> 
> I'm a bit surprised we're achieving 2 GB/s parity write throughput with
> the single threaded RAID5 driver on one core.  Those 3.3GHz Ive Bridge
> cores are stouter than I thought.  Disabling HT probably helped a bit
> here.  I'm anxious to see the top output file for this run (if you made
> one--you should for each and every FIO run).  Surely we're close to
> peaking the core here.

I'll run some more tests on the box soon, and make sure to collect the
top outputs for each run. Will email the lot when done. (See below why
there will be some delay).

>> stripe_cache_size = 8192
>>>    READ: io=131072MB, aggrb=2487MB/s, minb=2547MB/s, maxb=2547MB/s, mint=52697msec, maxt=52697msec
>>>   WRITE: io=131072MB, aggrb=1521MB/s, minb=1557MB/s, maxb=1557MB/s, mint=86188msec, maxt=86188msec
> 
> Interesting.  4096/8192 are both higher by ~300MB/s compared to the
> previous 1292MB/s you posted for 8192.  Some other workload must have
> been active during the previous run, or something else has changed.

Every run I took in this email was actually done twice, and I used the
larger result in the email (since we are trying to compare max
performance). However, I'm pretty sure the two runs were very similar in
results (less than 6MB/s difference).... I thought that maybe I should
have averaged the results, or run more tests, but really, I'm not that
seriously benchmarking to sell the stuff, I just need to know which one
worked best...

>> stripe_cache_size = 16384
>>>    READ: io=131072MB, aggrb=2494MB/s, minb=2554MB/s, maxb=2554MB/s, mint=52556msec, maxt=52556msec
>>>   WRITE: io=131072MB, aggrb=1368MB/s, minb=1401MB/s, maxb=1401MB/s, mint=95779msec, maxt=95779msec
>>
>> stripe_cache_size = 32768
>>>    READ: io=131072MB, aggrb=2489MB/s, minb=2549MB/s, maxb=2549MB/s, mint=52661msec, maxt=52661msec
>>>   WRITE: io=131072MB, aggrb=1138MB/s, minb=1165MB/s, maxb=1165MB/s, mint=115209msec, maxt=115209msec
> 
> This is why you test, and test, and test when tuning for performance.
> 4096 seems to be your sweet spot.

Yep, I ran those tests a lot more times (4096, 8192 and 16384) to try
and see if it was an anomaly, or some other strange effect...

>> (let me know if you want the full fio output....)
> 
> No, the summary is fine.  What's more more valuable to have the top
> output file for each run so I can see what's going on.  At 2 GB/s of
> throughput your interrupt rate should be pretty high, and I'd like to
> see the IRQ spread across the cores, as well as the RAID5 thread load,
> among other things.  I haven't yet looked at the file you sent, but I'm
> guessing it doesn't include this 1.6GB/s run.  I'm really interested in
> seeing that one, and the ones for 16384 and 32768.  WRT the latter two,
> I'm curious whether the much larger tables are causing excessive CPU
> burn, which may in turn be what lowers throughput.

OK, will prepare and send soon...

>> This seems to show that DRBD did not slow things down at all... I don't
> 
> I noticed.
> 
>> remember exactly when I did the previous fio tests with drbd connected,
>> but perhaps I've made changes to the drbd config since then and/or
>> upgraded from the debian stable drbd to 8.3.15
> Maybe it wasn't actively syncing when you made these FIO runs.

It was "in sync" prior to running the tests, and remained in sync during
the tests... However, with the newer 8.3.15 I've adjusted the config so
that if the secondary falls behind, it will drop out of sync, and catch
up when it can. There is no way the secondary can be writing at
1.6GB/sec over a 1Gbps ethernet, to a 4 x 2TB RAID10 HDD's....

>> Let's re-run the above tests with DRBD stopped:
> ...
>> stripe_cache_size = 4096
>>>    READ: io=131072MB, aggrb=2498MB/s, minb=2558MB/s, maxb=2558MB/s, mint=52465msec, maxt=52465msec
>>>   WRITE: io=131072MB, aggrb=1596MB/s, minb=1634MB/s, maxb=1634MB/s, mint=82126msec, maxt=82126msec
>>
>> stripe_cache_size = 8192
>>>    READ: io=131072MB, aggrb=2498MB/s, minb=2558MB/s, maxb=2558MB/s, mint=52469msec, maxt=52469msec
>>>   WRITE: io=131072MB, aggrb=1514MB/s, minb=1550MB/s, maxb=1550MB/s, mint=86565msec, maxt=86565msec
> ...
> 
> Numbers are identical.  Either BRBD wasn't actually copying anything
> during the previous FIO run, its nice level changed, its
> configuration/behavior changed with the new version, or something.
> Whatever the reason, it appears to be putting no load on the array.

Very surprising indeed... I'll still keep DRBD disconnected during the
day until I get a better handle on what is going on here.... I would
have expected *some* impact....

>> So, it looks like the ideal value is actually smaller (4096) although
>> there is not much difference between 8192 and 4096. It seems strange
>> that a larger cache size will actually reduce performance... I'll change
> 
> It's not strange at all, but expected.  As a table gets larger it takes
> more CPU cycles to manage it and more memory bandwidth; your cache miss
> rate increases, etc.  At a certain point this overhead becomes
> detrimental instead of beneficial.  In your case the size of the cache
> table outweighs the overhead and yields increased performance up to 80MB
> table size.  At 160MB and above the size of the table creates more
> overhead than performance benefit.
> 
> This is what system testing/tuning is all about.

Of course, I suppose I assumed cache table management had zero cost
(CPU/memory bandwidth) but at these speeds, it would be quite a big
factor...

>> to 4096 for the time being, unless you think "real world" performance
>> might be better with 8192?
> 
> These FIO runs are hitting your IO subsystem much harder than your real
> workloads every will.  Stick with 4096.

Very true... At 1.6GB/s that is equivalent (approx) to 8 x 1Gbps
ethernet, which is the maximum that all the machines can push at the
same time... and that is only write performance, read performance is
even higher.

>> Here are the results of re-running fio using the previous config (with
>> drbd connected with the stripe_cache_size = 8192):
>>>    READ: io=4096MB, aggrb=2244MB/s, minb=2298MB/s, maxb=2298MB/s, mint=1825msec, maxt=1825msec
>>>   WRITE: io=4096MB, aggrb=494903KB/s, minb=506780KB/s, maxb=506780KB/s, mint=8475msec, maxt=8475msec
>>
>> Perhaps the old fio test just isn't as well suited to the way drbd
>> handles things. Though the issue would be what sort of data the real
>> users are doing, because if that matches the old fio test or the new fio
>> test, it makes a big difference.
> 
> The significantly lower throughput of the "old" FIO job has *nothing* to
> do with DRBD.  It has everything to do with the parameters of the job
> file.  I thought I explained the differences previously.  If not, here
> you go:

Thanks :)

>> So, it looks like it is the stripe_cache_size that is affecting
>> performance, and that DRBD makes no difference whether it is connected
>> or not. Possibly removing it completely would increase performance
>> somewhat, but since I actually do need it, and that is somewhat
>> destructive, I won't try that :)
> 
> I'd do more investigating of this.  DRBD can't put zero load on the
> array if it's doing work.  Given it's a read only workload, it's
> possible the increased stripe cache is allowing full throttle writes
> while doing 100MB/s of reads, without writes being impacted.  You'll
> need to look deeper into the md statistics and/or monitor iostat, etc,
> during runs with DRBD active and actually moving data.

Yes, will check this out more carefully before I will re-enable DRBD
during the day....

>>> FIO runs on Windows:  http://www.bluestop.org/fio/
>>
>> Will check into that, it will be the ultimate end-to-end test.... Also,
> 
> Yes, it will.  As long as you're running at least 16-32 threads per TS
> client to overcome TCP/iSCSI over GbE latency, and the lack of AIO on
> Windows.  And you can't simply reuse the same job file.  The docs tell
> you which engine, and other settings, to use for Windows.

Well, I used mostly the same fio file... just changed the engine, and
size of the test down to 1GB (so the test would finish more quickly)

>> Hmmm, good point, I realised I could try and upgrade to the x64 windows
>> 2003, but I think I'd prefer to just move up to 2008 x64 (or 2012)...
>> For now, I'll just keep using my hacky 4GB RAM drive for the pagefile...
> 
> Or violate BCP and run two TS instances per Xen, or even four, with the
> appropriate number of users per each.  KSM will consolidate all the
> Windows and user application read only files (DLLs, exes, etc), yielding
> much more free real memory than with a single Windows TS instance.
> AFAIK Windows has no memory merging so you can't over commit memory
> other than with the page file, which is horribly less efficient than KSM.

BCP = Best Computing Practise ?
KSM = Kernel SamePage Merging ? (Had to ask wikipedia for this one)...

I'm not sure xen supports this currently.... However, in addition to
either saving RAM / spending more CPU managing this, there is also the
licensing consideration of purchasing more windows server licenses.
Overall, probably better spend on newer versions/upgrading...

>> I meant I hadn't crossed off as many items from my list of things to
>> do... Not that I hadn't improved performance significantly :)
> I know, was just poking you in the ribs. ;)

Ouch :)

>>> To find out how much of the 732MB/s write throughput increase is due to
>>> buffering 512 stripes instead of 16, simply change it back to 256,
>>> re-run my FIO job file, and subtract the write result from 1292MB/s.
>>
>> So, running your FIO job file with the original 256 give a write speed
>> of 950MB/s and the previous FIO file gives 509MB/s. So it would seem the
>> increase in stripe_cache_size from 256 to 4096 give an increase in your
>> FIO job from 950MB/s to 1634MB/s which is a significant speed boost. I
> 
> 72 percent increase with this synthetic workload, by simply increasing
> the stripe cache.  Not bad eh?  This job doesn't present an accurate
> picture of real world performance though, as most synthetic tests don't.
> 
> Get DRBD a hump'n and your LVM snapshot(s) in place, all the normal
> server side load, then fire up the 32 thread FIO test on each TS VM to
> simulate users (I could probably knock out this job file if you like).
> Then monitor the array throughput with iostat or similar.  This would be
> about as close to peak real world load as you can get.

Interestingly I noted that fio can run in server/client mode, so in
theory I should be able to run a central job to instruct all the other
machines to start testing at the same time.... I'll work on this soon...

>> must wonder why we have a default of 256 when this can make such a
>> significant performance improvement?  A value of 4096 with a 5 drive raid
>> array is only 80MB of cache, I suspect very few users with a 5 drive
>> RAID array would be concerned about losing 80MB of RAM, and a 2 drive
>> RAID array would only use 32MB ...
> 
> The stripe cache has nothing to do with device count, but hardware
> throughput.  Did you happen to notice what occurred when you increased
> cache size past your 4096 sweet spot to 32768?  Throughput dropped by
> ~500MB/s, almost 1/3rd.  Likewise, for the slow rust array whose sweet
> spot is 512, making the default 4096 will decrease his throughput, and
> eat 80MB RAM for nothing.  Defaults are chosen to work best with the
> lowest common denominator hardware, not the Ferrari.

Oh yeah, I forgot about HDD's :) However, I would have thought the cache
would be even more effective when the CPU/memory is so much faster than
the storage medium.... Oh well, that is somebody else's performance
testing/tuning job to work out, I've got enough on my plate right now :)


Thanks to the tip about running fio on windows, I think I've now come
full circle.... Today I had numerous complaints from users that their
outlook froze/etc, and some cases were the TS couldn't copy a file from
the DC to it's local C: (iSCSI). The cause was the DC was logging events
with event ID 2020 which is "The server was unable to allocate from the
system paged pool because the pool was empty". Supposedly the solution
to this is tuning two random numbers in the registry, not much is said
what the consequences of this are, nor about how to calculate the
correct value. However, I think I've worked it out... first, let's look
at the fio results.

Running fio on one of the TS (win2003) against it's local C: (xen ->
iSCSI -> etc) gives this result:
> READ: io=16384MB, aggrb=239547KB/s, minb=239547KB/s, maxb=239547KB/s, mint=70037msec, maxt=0msec
> WRITE: io=16384MB, aggrb=53669KB/s, minb=53669KB/s, maxb=53669KB/s, mint=312601msec, maxt=0msec

To me, the read performance is as good as it can get (239MB/s looks like
2 x 1Gbps ethernet performance)...
The write performance might be a touch slow, but 53MB/s should be more
than enough to keep the users happy. I can come back to this later,
would be nice to see this closer to 200MB/s...

Running the same fio test on the same TS (win2003) against a SMB share
from the DC (SMB -> Win2000 -> Xen -> iSCSI -> etc)
> READ: io=16384MB, aggrb=14818KB/s, minb=14818KB/s, maxb=14818KB/s, mint=1132181msec, maxt=0msec
> WRITE: io=16384MB, aggrb=8039KB/s, minb=8039KB/s, maxb=8039KB/s, mint=2086815msec, maxt=0msec

This is pretty shockingly slow, and seems to clearly indicate why the
users are so upset... 14MB/s read and 8MB/s write, it's a wonder they
haven't formed a mob and lynched me yet!

However, the truly useful information is that during the read portion of
the test, the DC has a CPU load of 100% (no variation, just pegged at
100%), during the write portion, it fluctuates between 80% to 100%.

This could also indicate why the pool was empty, if the CPU is so busy,
it doesn't have time to clean the pool, and so it runs out... One of the
registry entries was to start cleaning the pool sooner (default 80%
suggested to reduce down to 60% or even 40%).

So, I tried again to re-configure windows to support multiprocessor, but
that was another clear failure. (You can change the value/driver in
windows easily, but on reboot it fails to find the HDD, so BSoD or
usually just hangs). Supposedly this can be changed with a "install on
top", but I'll need to take a copy and test that out remotely.
Especially being the DC I am not very comfortable with that.

Next option is to take another shot at upgrade to Win2003, which should
solve the multiprocessor issue, as well as provide much better support
for virtualisation. Though again, it's a major upgrade and could just
introduce a whole bunch of other problems....

Anyway, I've tried to tune a few basic things:
Remove some old devices from Device Manager on the DC
Uninstall some applications/drivers
Disable old unused services (backup software)
Extended the data drive from 279GB to 300GB (it was 90% full, now 84% full)
Adjusted registry entry to try and allocate additional memory to the pool
Increased xen memory allocation for the DC VM from 4096MB to 4200MB. I
suspect xen was keeping some of this memory for it's own overhead, and I
want the VM to get a full 4GB.

I just need to restart the san, to check it is picking up the right
settings on boot, and then put everything back online, and I'm done for
another night....

I'll come back to the benchmarking as soon as I get this DC CPU issue
resolved.

Thanks,
Adam

-- 
Adam Goryachev
Website Managers
www.websitemanagers.com.au

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance -  5x SSD RAID5 - effects of stripe cache sizing
  2013-03-05 15:53                                       ` Adam Goryachev
@ 2013-03-07  7:36                                         ` Stan Hoeppner
  2013-03-08  0:17                                           ` Adam Goryachev
  0 siblings, 1 reply; 131+ messages in thread
From: Stan Hoeppner @ 2013-03-07  7:36 UTC (permalink / raw)
  To: Adam Goryachev; +Cc: Linux RAID

On 3/5/2013 9:53 AM, Adam Goryachev wrote:
> On 05/03/13 20:30, Stan Hoeppner wrote:

> BCP = Best Computing Practise ?

Typically "best current practice".

> Thanks to the tip about running fio on windows, I think I've now come
> full circle.... Today I had numerous complaints from users that their
> outlook froze/etc, and some cases were the TS couldn't copy a file from
> the DC to it's local C: (iSCSI). The cause was the DC was logging events
> with event ID 2020 which is "The server was unable to allocate from the
> system paged pool because the pool was empty". Supposedly the solution
> to this is tuning two random numbers in the registry, not much is said
> what the consequences of this are, nor about how to calculate the
> correct value. 
...
> Running the same fio test on the same TS (win2003) against a SMB share
> from the DC (SMB -> Win2000 -> Xen -> iSCSI -> etc)
>> READ: io=16384MB, aggrb=14818KB/s, minb=14818KB/s, maxb=14818KB/s, mint=1132181msec, maxt=0msec
>> WRITE: io=16384MB, aggrb=8039KB/s, minb=8039KB/s, maxb=8039KB/s, mint=2086815msec, maxt=0msec

Run FIO on the DC itself and see what your NTFS throughput is to this
300GB filesystem.  Use a small file, say 2GB, since the FS is nearly
full.  Post results.

Fire up the Windows CLI FTP client in a TS session DOS box and do a GET
and PUT into this filesystem share on the DC.  This will tell us if the
TS to DC problem is TCP in general or limited to SMB.  Post transfer
rate results for GET and PUT.

> This is pretty shockingly slow, and seems to clearly indicate why the
> users are so upset... 14MB/s read and 8MB/s write, it's a wonder they
> haven't formed a mob and lynched me yet!

I've never used FIO on Windows against a Windows SMB share.  And
according to Google nobody else does.  So before we assume these numbers
paint an accurate picture of your DC SMB performance, and that the CPU
burn isn't due to an anomaly of FIO, you should run some simple Windows
file copy tests in Explorer and use Netmeter to measure the speed.  If
they're in the same ballpark then you know you can somewhat trust FIO
for SMB testing.  If they're wildly apart, probably not.

> However, the truly useful information is that during the read portion of
> the test, the DC has a CPU load of 100% (no variation, just pegged at
> 100%), during the write portion, it fluctuates between 80% to 100%.

That 100% CPU is bothersome.  Turn off NTFS compression on any/all NTFS
volumes residing on SAN LUNs on the SSD array.  You're burning 100% DC
CPU at ~12MB/s data rate on the DC, so I can only assume it's turned on
for this 300GB volume.  These SSDs do on the fly compression, and very
quickly as you've seen.  Doing NTFS compression on top simply wastes cycles.

This should drop the CPU burn for new writes to the filesystem.  It
probably won't for reads, since NTFS must still decompress the existing
250GB+ of files.  If CPU drops considerably for writes but reads still
eat 100%, the only fix for this is to backup the filesystem, reformat
the device, then restore.  On the off chance that NTFS compression is
more efficient than the SandForce controller, you probably want to
increase the size of the volume before formatting.  And in fact,
Sysadmin 101 tells us never to run a production filesystem at more than
~70% capacity, so it would be smart to bump it up to 400GB to cover your
bases.

My second recommendation is to turn off the indexing service for all
these NTFS volumes as well as this will conserve CPU cycle as well.

> Extended the data drive from 279GB to 300GB (it was 90% full, now 84% full)

Growing filesystem in small chunks like this is a recipe for disaster.
 Your free space map is always heavily fragmented and is very large.
The more entries the filesystem driver must walk the more CPU you burn.
 Recall we just discussed the table walking overhead of the md/RAID
stripe cache?  Filesystem maps/tables/B+ trees are much, much larger
structures.  When they don't fit it cache we read memory, and when they
don't fit in memory (remember you "pool" problem) we must read from disk.

If you've been expanding this NTFS this way for a while it would also
explain some of your CPU burn at the DC.  FYI, XFS is a MUCH higher
performance and much more efficient filesystem than NTFS ever dreamed of
becoming, but even XFS suffers slow IO and CPU burn due to heavily
fragmented free space.

-- 
Stan


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance -  5x SSD RAID5 - effects of stripe cache sizing
  2013-03-07  7:36                                         ` Stan Hoeppner
@ 2013-03-08  0:17                                           ` Adam Goryachev
  2013-03-08  4:02                                             ` Stan Hoeppner
  0 siblings, 1 reply; 131+ messages in thread
From: Adam Goryachev @ 2013-03-08  0:17 UTC (permalink / raw)
  To: stan; +Cc: Linux RAID

On 07/03/13 18:36, Stan Hoeppner wrote:
> On 3/5/2013 9:53 AM, Adam Goryachev wrote:
>> On 05/03/13 20:30, Stan Hoeppner wrote:
>> Thanks to the tip about running fio on windows, I think I've now come
>> full circle.... Today I had numerous complaints from users that their
>> outlook froze/etc, and some cases were the TS couldn't copy a file from
>> the DC to it's local C: (iSCSI). The cause was the DC was logging events
>> with event ID 2020 which is "The server was unable to allocate from the
>> system paged pool because the pool was empty". Supposedly the solution
>> to this is tuning two random numbers in the registry, not much is said
>> what the consequences of this are, nor about how to calculate the
>> correct value. 
> ...
>> Running the same fio test on the same TS (win2003) against a SMB share
>> from the DC (SMB -> Win2000 -> Xen -> iSCSI -> etc)
>>> READ: io=16384MB, aggrb=14818KB/s, minb=14818KB/s, maxb=14818KB/s, mint=1132181msec, maxt=0msec
>>> WRITE: io=16384MB, aggrb=8039KB/s, minb=8039KB/s, maxb=8039KB/s, mint=2086815msec, maxt=0msec
> 
> Run FIO on the DC itself and see what your NTFS throughput is to this
> 300GB filesystem.  Use a small file, say 2GB, since the FS is nearly
> full.  Post results.

Can't, I don't see a version of fio that runs on win2000...

> Fire up the Windows CLI FTP client in a TS session DOS box and do a GET
> and PUT into this filesystem share on the DC.  This will tell us if the
> TS to DC problem is TCP in general or limited to SMB.  Post transfer
> rate results for GET and PUT.

I had somewhat forgotten about FTP, and it does provide nice simple
performance results/numbers too. I'll give this a try, talking to
another linux box on the network, it should achieve 100MB/s (gigabit
speeds) or close to it. I'll also run the same FTP test from one of the
2003 boxes for comparison.

>> This is pretty shockingly slow, and seems to clearly indicate why the
>> users are so upset... 14MB/s read and 8MB/s write, it's a wonder they
>> haven't formed a mob and lynched me yet!
> 
> I've never used FIO on Windows against a Windows SMB share.  And
> according to Google nobody else does.  So before we assume these numbers
> paint an accurate picture of your DC SMB performance, and that the CPU
> burn isn't due to an anomaly of FIO, you should run some simple Windows
> file copy tests in Explorer and use Netmeter to measure the speed.  If
> they're in the same ballpark then you know you can somewhat trust FIO
> for SMB testing.  If they're wildly apart, probably not.

Well, during the day, under normal user load, the CPU frequently rises
to around 70 to 80%, while this is not as clear cut as 100%, it makes me
worry that it is limiting performance.

>> However, the truly useful information is that during the read portion of
>> the test, the DC has a CPU load of 100% (no variation, just pegged at
>> 100%), during the write portion, it fluctuates between 80% to 100%.
> 
> That 100% CPU is bothersome.  Turn off NTFS compression on any/all NTFS
> volumes residing on SAN LUNs on the SSD array.  You're burning 100% DC
> CPU at ~12MB/s data rate on the DC, so I can only assume it's turned on
> for this 300GB volume.  These SSDs do on the fly compression, and very
> quickly as you've seen.  Doing NTFS compression on top simply wastes cycles.

NTFS compression is already disabled on all volumes.... I've *never*
enabled it on any system I've ever been responsible for, and never seen
anyone else do that. However, due to the age of this system, it is
possible that it has been enabled and then disabled again at some point.

> This should drop the CPU burn for new writes to the filesystem.  It
> probably won't for reads, since NTFS must still decompress the existing
> 250GB+ of files.  If CPU drops considerably for writes but reads still
> eat 100%, the only fix for this is to backup the filesystem, reformat
> the device, then restore.  

Since it was already disabled, I would expect that the majority of files
currently in use (especially the problematic outlook pst files) have
already been modified/decompressed anyway.

> On the off chance that NTFS compression is
> more efficient than the SandForce controller, you probably want to
> increase the size of the volume before formatting.  And in fact,
> Sysadmin 101 tells us never to run a production filesystem at more than
> ~70% capacity, so it would be smart to bump it up to 400GB to cover your
> bases.

I only bumped it up a small amount just in case I got burned by windows
2000 having an upper limit on disk size supported. I couldn't find a
clear answer on the maximum size supported.... I'll probably increase it
to at least 400 or even 500 as soon as I complete the upgrade to win2003.

> My second recommendation is to turn off the indexing service for all
> these NTFS volumes as well as this will conserve CPU cycle as well.

That is a good thought... I recently did do a complete file search on
the volume, and it seemed to need to traverse the directory tree anyway
(I was looking for files *.bak to delete old pst files).

>> Extended the data drive from 279GB to 300GB (it was 90% full, now 84% full)
> 
> Growing filesystem in small chunks like this is a recipe for disaster.
>  Your free space map is always heavily fragmented and is very large.
> The more entries the filesystem driver must walk the more CPU you burn.
>  Recall we just discussed the table walking overhead of the md/RAID
> stripe cache?  Filesystem maps/tables/B+ trees are much, much larger
> structures.  When they don't fit it cache we read memory, and when they
> don't fit in memory (remember you "pool" problem) we must read from disk.

Yes, I did try and run a defrag (win2000 version) on the volume. I was
fairly curious about whether this would have any advantage given the SSD
backed filesystem where random IO shouldn't matter. Though I did think
it might still offer some improvement if both free space and files were
contiguous. However, after running on two occasions for about 20 hours,
it added around 2 or 3 very narrow defragged sections with no real
progress. I think the defrag is either running very slowly as well,
and/or requires more free space to run more efficiently.

> If you've been expanding this NTFS this way for a while it would also
> explain some of your CPU burn at the DC.  FYI, XFS is a MUCH higher
> performance and much more efficient filesystem than NTFS ever dreamed of
> becoming, but even XFS suffers slow IO and CPU burn due to heavily
> fragmented free space.

Nope, it has had the same 300G HDD (279G) for at least 6 years (as far
as I know). I'm pretty sure is has only been extended by physical
replacement of the HDD from time to time.

The current plan is to upgrade to win2003, I'm hoping this will improve
performance equivalent to what is being achieved on the other 2003
servers, which should make the users happy again. I may increase the
disk space and have another crack at defrag prior to the upgrade since
the upgrade won't happen until next weekend at the earliest.

Regards,
Adam

-- 
Adam Goryachev
Website Managers
www.websitemanagers.com.au

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance -  5x SSD RAID5 - effects of stripe cache sizing
  2013-03-08  0:17                                           ` Adam Goryachev
@ 2013-03-08  4:02                                             ` Stan Hoeppner
  2013-03-08  5:57                                               ` Mikael Abrahamsson
  0 siblings, 1 reply; 131+ messages in thread
From: Stan Hoeppner @ 2013-03-08  4:02 UTC (permalink / raw)
  To: Adam Goryachev; +Cc: Linux RAID

On 3/7/2013 6:17 PM, Adam Goryachev wrote:
> On 07/03/13 18:36, Stan Hoeppner wrote:

>> Run FIO on the DC itself and see what your NTFS throughput is to this
>> 300GB filesystem.  Use a small file, say 2GB, since the FS is nearly
>> full.  Post results.
> 
> Can't, I don't see a version of fio that runs on win2000...

Well find another utility, or just do a file copy with a stop watch.
The point here is to determine if NTFS is part of the bottleneck, as you
already verified, IIRC, that FIO at the hypervisor level is doing ~100MB/s.

> I had somewhat forgotten about FTP, and it does provide nice simple
> performance results/numbers too. I'll give this a try, talking to
> another linux box on the network, it should achieve 100MB/s (gigabit
> speeds) or close to it. I'll also run the same FTP test from one of the
> 2003 boxes for comparison.

I think you missed the point.  SMB with TS<->DC is ~10MB/s, but should
be more like 100MB/s.  Run the FTP client on TS against the FTP service
on the DC.  Get and put files from/to the 300GB NTFS volume that is
shared.  If FTP is significantly faster then we know SMB is the problem,
or something related to SMB, not TCP.

> Well, during the day, under normal user load, the CPU frequently rises
> to around 70 to 80%, while this is not as clear cut as 100%, it makes me
> worry that it is limiting performance.

Agreed.  CPU burn shouldn't be that high for SMB serving.  The cause
could be any number of things, or a combination of things.  The steps
I'm suggesting will allow us to identify the cause of the burn and the
low SMB throughput.

> NTFS compression is already disabled on all volumes.... I've *never*
> enabled it on any system I've ever been responsible for, and never seen
> anyone else do that. However, due to the age of this system, it is
> possible that it has been enabled and then disabled again at some point.

Ok, good.  No compression.  How about NTFS encryption?

> I only bumped it up a small amount just in case I got burned by windows
> 2000 having an upper limit on disk size supported. I couldn't find a
> clear answer on the maximum size supported.... I'll probably increase it
> to at least 400 or even 500 as soon as I complete the upgrade to win2003.

2TB is the minimum ceiling.  If it's a Dynamic Disk the limit is 16TB:
http://technet.microsoft.com/en-us/library/cc779002%28v=ws.10%29.aspx

>> Growing filesystem in small chunks like this is a recipe for disaster.
>>  Your free space map is always heavily fragmented and is very large.
>> The more entries the filesystem driver must walk the more CPU you burn.
>>  Recall we just discussed the table walking overhead of the md/RAID
>> stripe cache?  Filesystem maps/tables/B+ trees are much, much larger
>> structures.  When they don't fit it cache we read memory, and when they
>> don't fit in memory (remember you "pool" problem) we must read from disk.
> 
> Yes, I did try and run a defrag (win2000 version) on the volume. I was

*NEVER* run a defragger against a filesystem residing on SSDs.  It will
shorten the life of the flash cells due to wear leveling, and thus the
SSDs themselves:  http://www.intel.com/support/ssdc/hpssd/sb/CS-029623.htm#5

And it will not fix the problem I was describing.

>> If you've been expanding this NTFS this way for a while it would also
>> explain some of your CPU burn at the DC.  FYI, XFS is a MUCH higher
>> performance and much more efficient filesystem than NTFS ever dreamed of
>> becoming, but even XFS suffers slow IO and CPU burn due to heavily
>> fragmented free space.
> 
> Nope, it has had the same 300G HDD (279G) for at least 6 years (as far
> as I know). I'm pretty sure is has only been extended by physical
> replacement of the HDD from time to time.

Ok. Good.  We're narrowing things down a bit.

> The current plan is to upgrade to win2003, I'm hoping this will improve
> performance equivalent to what is being achieved on the other 2003
> servers, which should make the users happy again. I may increase the
> disk space and have another crack at defrag prior to the upgrade since
> the upgrade won't happen until next weekend at the earliest.

Do NOT defrag.  This should be well driven home at this point.

When you upgrade to 2003, make sure you use the open source Xen
paravirtualized SCSI and NIC drivers:

http://jolokianetworks.com/70Knowledge/Virtualization/Open_Source_Windows_Paravirtualization_Drivers_for_Xen

-- 
Stan



^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance -  5x SSD RAID5 - effects of stripe cache sizing
  2013-03-08  4:02                                             ` Stan Hoeppner
@ 2013-03-08  5:57                                               ` Mikael Abrahamsson
  2013-03-08 10:09                                                 ` Stan Hoeppner
  0 siblings, 1 reply; 131+ messages in thread
From: Mikael Abrahamsson @ 2013-03-08  5:57 UTC (permalink / raw)
  To: Linux RAID

On Thu, 7 Mar 2013, Stan Hoeppner wrote:

> I think you missed the point.  SMB with TS<->DC is ~10MB/s, but should 
> be more like 100MB/s.  Run the FTP client on TS against the FTP service 
> on the DC.  Get and put files from/to the 300GB NTFS volume that is 
> shared.  If FTP is significantly faster then we know SMB is the problem, 
> or something related to SMB, not TCP.

Don't know if it's obvious to everybody, so if you already know the 
internals of SMB, you can stop reading:

Older versions of SMB uses a 60 kilobyte block for transferring files. 
This works by requesting a block, waiting for that block to be delivered, 
then requesting the next one. Those who remember Xmodem will know what I'm 
talking about.

So if there is latency introduced somewhere, SMB performance deteriorates 
severely, to the point where if there is 30 ms delay, one can't really get 
more than 1 megabyte/s transfer speed, even if there is a 10GE pipe 
between the involved computers.

1 second / 30 milliseconds = 33 roundtrips, 60 kilbytes per roundtrip 
means 1980 kilobytes/s max theoretical throughput for SMB on 30 ms link.

Latencies can be introduced by trying to read for HDDs as well, so... this 
might be worthwile to look at.

I don't know exactly when the better versions of SMB/CIFS were introduced, 
but I believe it happened in Vista / Windows server 2008.

-- 
Mikael Abrahamsson    email: swmike@swm.pp.se

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance -  5x SSD RAID5 - effects of stripe cache sizing
  2013-03-08  5:57                                               ` Mikael Abrahamsson
@ 2013-03-08 10:09                                                 ` Stan Hoeppner
  2013-03-08 14:11                                                   ` Mikael Abrahamsson
  0 siblings, 1 reply; 131+ messages in thread
From: Stan Hoeppner @ 2013-03-08 10:09 UTC (permalink / raw)
  To: Mikael Abrahamsson; +Cc: Linux RAID

On 3/7/2013 11:57 PM, Mikael Abrahamsson wrote:
> On Thu, 7 Mar 2013, Stan Hoeppner wrote:
> 
>> I think you missed the point.  SMB with TS<->DC is ~10MB/s, but should
>> be more like 100MB/s.  Run the FTP client on TS against the FTP
>> service on the DC.  Get and put files from/to the 300GB NTFS volume
>> that is shared.  If FTP is significantly faster then we know SMB is
>> the problem, or something related to SMB, not TCP.
> 
> Don't know if it's obvious to everybody, so if you already know the
> internals of SMB, you can stop reading:
> 
> Older versions of SMB uses a 60 kilobyte block for transferring files.
> This works by requesting a block, waiting for that block to be
> delivered, then requesting the next one. Those who remember Xmodem will
> know what I'm talking about.

The default MaxTransmitBufferSize is actually quite low, 16644 bytes if
system RAM is >512MB, and 4356 bytes if RAM <512MB.  You can get it up
to 60KB read and 64KB write by modifying some other reg values.  This
applies all the way up to Server 2008.  But transmit buffers size isn't
the problem in this case.

> So if there is latency introduced somewhere, SMB performance
> deteriorates severely, to the point where if there is 30 ms delay, one
> can't really get more than 1 megabyte/s transfer speed, even if there is
> a 10GE pipe between the involved computers.

It's very unlikely that he's hitting latency over the wire.  GbE latency
is ~250uS (0.25ms).  We know from others' published experience that the
8111 series Realteks can be good for up to 90MB/s with fast CPUs, but
that others have trouble getting 25MB/s from them.  They could be part
of the problem here due to drivers, virtualization, etc.  I'm sure we'll
be looking at this.  Recall I recommended some time ago that Adam should
perform end-to-end netcat testing on all of his hosts' NIC ports to get
a baseline of TCP performance.

> Latencies can be introduced by trying to read for HDDs as well, so...
> this might be worthwile to look at.

The problem could be one of any number of things, or a combination.
It's too early to tell without testing.  And the FTP test won't be the
last.  Hunting down Windows server performance problems is bad enough,
but once virtualized it gets worse.

> I don't know exactly when the better versions of SMB/CIFS were
> introduced, but I believe it happened in Vista / Windows server 2008.

Yes, SMB 2.0 was introduced with Vista and 2008.  It has higher
throughput over high latency links due to pipelining, but this doesn't
yield much on a LAN, even Fast Ethernet.  W2K/XP default SMB can hit the
25MB/s peak duplex data rate of 100FDX.

-- 
Stan


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance -  5x SSD RAID5 - effects of stripe cache sizing
  2013-03-08 10:09                                                 ` Stan Hoeppner
@ 2013-03-08 14:11                                                   ` Mikael Abrahamsson
  0 siblings, 0 replies; 131+ messages in thread
From: Mikael Abrahamsson @ 2013-03-08 14:11 UTC (permalink / raw)
  To: Stan Hoeppner; +Cc: Linux RAID

On Fri, 8 Mar 2013, Stan Hoeppner wrote:

> The default MaxTransmitBufferSize is actually quite low, 16644 bytes if 
> system RAM is >512MB, and 4356 bytes if RAM <512MB.  You can get it up 
> to 60KB read and 64KB write by modifying some other reg values.  This 
> applies all the way up to Server 2008.  But transmit buffers size isn't 
> the problem in this case.

Indeed, found this:

http://blogs.msdn.com/b/openspecification/archive/2009/04/10/smb-maximum-transmit-buffer-size-and-performance-tuning.aspx

It's not clear to me how the transmit buffers interact with reading from 
drive. If the 60 kilobyte read request comes in, 60 kilobytes (or 
whatever) is read, sent out, wait, new 60 kilobyte request comes in, needs 
to be read from drive, sent, wait. If automatic read-ahead isn't done and 
the blocks read are cached, I can see this going very inefficient very 
quickly.

> Yes, SMB 2.0 was introduced with Vista and 2008.  It has higher 
> throughput over high latency links due to pipelining, but this doesn't 
> yield much on a LAN, even Fast Ethernet.  W2K/XP default SMB can hit the 
> 25MB/s peak duplex data rate of 100FDX.

Yes, if latency is low, this isn't a problem.

-- 
Mikael Abrahamsson    email: swmike@swm.pp.se

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance - new kernel results
  2013-03-01 16:10                 ` Adam Goryachev
@ 2013-03-10 15:35                   ` Charles Polisher
  2013-04-15 12:23                     ` Adam Goryachev
  0 siblings, 1 reply; 131+ messages in thread
From: Charles Polisher @ 2013-03-10 15:35 UTC (permalink / raw)
  To: Adam Goryachev; +Cc: John Stoffel, Dave Cundiff, linux-raid

On Mar 02, 2013 Adam Goryachev wrote:
> On 24/02/13 02:57, John Stoffel wrote:
> > Can I please ask you to sit down and write a paper for USENIX on this
> > whole issue and how you resolved it?  You and Stan have done a great
> > job here documenting and discussing the problems, troubleshooting
> > methods and eventual solution(s) to the problem.  
> > 
> > It would be wonderful to have some diagrams to go with all this
> > discussion, showing the original network setup, iSCSI disk setup,
> > etc.  Then how to updated and changed thing to find bottlenecks. 
> > 
> > The interesting thing is the complete slowdown when using LVM
> > snapshots, which points to major possibilities for performance
> > improvements there.  But those improvements will be hard to do without
> > being able to run on real hardware, which is expensive for people to
> > have at home.  
> > 
> > I've been following this discussion from day one and really enjoying
> > it and I've learned quite a bit about iSCSI, networking and some of
> > the RAID issues.  I too run Debian stable on my home NFS/VM/mail/mysql
> > server and I've been getting frustrated by how far back it is, even
> > with backports.  I got burned in the past by testing, which is why I
> > stay on stable, but now I'm feeling like I'm getting burned on stable
> > too.  *grin*  It's a balancing act for sure!

Hi Adam, John, and Stan,

I too have been poring over this thread for weeks while building and
testing arrays in my lab, trying techniques you've been tossing
around, diagramming hardware & software, and generating plots of
the results. It's quite interesting work though friends are
asking pointed questions about where I've been. 

Last night's episode was tweaking the IO queue scheduler -- with
a raid0-on-raid5x2 I saw a 40% boost in IOPS for 80/20 mix of
random read/write (noop vs cfq).

> I've never writen anything like that, but I think I could write a book
> on this. I keep thinking I should get a blog and put stuff like this on
> there, but there is always something else to do, and I'm not the sort of
> person to write in my diary every day :)
> 
> I've already written up a sort of non-technical summary for the client
> (about 5 pages), and just sent a non-detailed technical summary to the
> list. Once everything is completed and settled, I can try and combine
> those two, maybe throw in a bunch of extra details (command lines,
> config files, etc), and see where it ends up. I suppose you are
> volunteering as editor <G>

I can assist with testbeds, scripts, and visualizations that
support this process. I also have some editing skills. My
personal goal for this year (and maybe next) is to build an open
source tool that takes a system description, projects figures of
merit (price, performance, reliability) for specified workloads,
and scripts the setup, benching, data collection, and
visualization tasks. It seems there could be a lot of overlap
between my project and what is needed to put together an article.
Contact me if you'd like to explore working together.

Lastly, Adam: If MS Active Directory 2003 has any large group
objects (> 500 members), there can be large peaks in replication
traffic when group memberships change. There are other scenarios
for AD 2003 high-traffic issues. You could try using MS's
typeperf command line utility or their performance monitor GUI
to check the "DRA" inbound and outbound traffic during periods
of high disk/net activity. Also from experience you might check
if high CPU is related to anti-virus software that hasn't been
fenced out from checking the DIT.

Best regards,
-- 
Charles Polisher



^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance - new kernel results
  2013-03-10 15:35                   ` Charles Polisher
@ 2013-04-15 12:23                     ` Adam Goryachev
  2013-04-15 15:31                       ` John Stoffel
                                         ` (3 more replies)
  0 siblings, 4 replies; 131+ messages in thread
From: Adam Goryachev @ 2013-04-15 12:23 UTC (permalink / raw)
  To: linux-raid

It's been quite a while, and I just wanted to post an update on the
current status of my problems.

As a quick refresh, the users were complaining of freezing, especially
when using outlook (pst file stored on file server), and sometimes
corrupted pst files or excel files with windows logging delayed write
failures.

Most users using MS Win 2003 Terminal Servers

File Server was MS Win 2000 Server

All servers are Virtual Machines running one VM per physical machine
under Xen (Debian Linux Testing)

All disk images are stored on the storage server (Debian Linux Stable,
upgraded to backports kernel).
Storage server config is:
VM sees normal HDD
Linux physical machine exports disk device
Linux physical machine imports iSCSI from Storage Server
Storage server exports iSCSI device
One Logical Volume for each VM
The Physical Volume is a DRBD
The DRBD is a RAID5 using MD
The MD is a 5 x 480GB Intel SSD
The SSD's are connected with a LSI SATA 3 controller

The storage server has a single bond0 with 8 x Gbps ethernet connections
for the iSCSI network

Each Physical machine has 2 x Gbps ethernet for iSCSI plus 1 Gbps for
the "user" network

Testing has shown that each VM can read/write at between 200 and 230MB/s
concurrently to the storage server (up to 4 at a time obviously).

So, finally, I've found that the issue is NOT RAID related, in fact, it
is not even disk/storage related! Certainly, there were one or more
problems causing slow performance of the storage backend, but I would
suggest that it was never the actual problem. (Even though fixing those
issues was definitely a plus in the long term).

After sitting on-site for a few days, I eventually noticed my terminal
server session (across the LAN) stopped responding, after ping testing,
I found the server went offline for around 10 seconds before coming back
and working normally (yes, a total accident I discovered this). I added
a small script with fping to test all physical machine IP's and all VM
IP's every second for 60 seconds. Then, it will log the date/time the
test started, and each IP plus all 60 results for any IP that lost one
or more packets. (Reminder, this is over the LAN only, no WAN connections).

I found a "pattern" that showed one (at a time) random IP (VM or
physical, linux or windows), would stop responding to pings for between
10 and 50 seconds, then come back and work normally. These failures
would happen between zero and three times a day, generally occurring on
busy servers, either in the morning (users logging in) or afternoon
(users logging out).
In addition, random IP's drop a single ping packet around 40 or more
times per day, during business hours only.
There is never an outage of between two and 10 pings. There are lots of
single pings lost, and plenty between 10 and 50, but never any between 1
and 10. Sometimes (rarely) two or three in one minute, but not consecutive.

I suspect that the single ping packets being lost are an indication of a
problem, but this should not impact the users (TCP should look after the
re-transmission, etc). Wether this is related to the longer 10-50 second
outage I'm not sure.

I would expect that this network failure would explain all of the user
reported symptoms:
1) Terminal server freezes up and need to reboot the thin client to
   fix it (ie, wait a minute and reconnect to the session).
2) Windows delayed write failures normally manage to succeed (probably
   thanks to TCP reliability features) but sometimes SMB/TCP times out
   and so windows notices the network failure, and the write is failed,
   possibly corrupting the file being written to.

I've copied the testing script to a second machine, and the outages
(lasting more than a second) that each machine detects match (+/- a second).

All network cables were replaced with brand new cat6 cables (1m or 2M)
about 6 weeks ago.

The switch was a Netgear managed gigabit switch, but I replaced that
with a slightly older Netgear unmanaged gigabit switch with no change in
the results.

Overall network utilisation is minimal, the busiest server has an
average utilisation of 5Mbps during the day. Peak after hours traffic
(rsync backups over the LAN) will show sustained network utilisation of
around 80Mbps.

At this stage, I've moved totally away from suspecting a disk
performance or similar issue, and I don't think this can get any more
offtopic, but wanted to post a followup to my issue here. I still intend
to write something up to summarise the entire process once I eventually
get it resolved.

In the meantime, if anyone has any hints or suggestions on why a LAN
might be dropping packets like this, I'd be really happy to hear it,
because I'm scraping the bottom. Currently I'm using tcpdump to capture
ALL network traffic to local disk on 4 machines, and hoping that network
drop will happen on one of these 4. Then I can use wireshark to see what
happened during that time. If you've seen anything similar, got a random
suggestion (no matter how dumb) I'd be happy to hear it please.

Regards,
Adam

-- 
Adam Goryachev
Website Managers
www.websitemanagers.com.au

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance - new kernel results
  2013-04-15 12:23                     ` Adam Goryachev
@ 2013-04-15 15:31                       ` John Stoffel
  2013-04-17 10:15                         ` Adam Goryachev
  2013-04-15 16:49                       ` Roy Sigurd Karlsbakk
                                         ` (2 subsequent siblings)
  3 siblings, 1 reply; 131+ messages in thread
From: John Stoffel @ 2013-04-15 15:31 UTC (permalink / raw)
  To: Adam Goryachev; +Cc: linux-raid


Hi Adam,

Thanks for all your posting and continued updates on your travails
about getting the performance on your Xen systems and storage.  A
complete eye-opener in alot of ways.  

Now, from looking over your report, it strongly smells of a problem in
the network switch.  I think you have just one in the core of your
network, correct?  I'd probably try to bring up a test network (if you
have the spare systems in a lab) and try to replicate the packet
drops.

But in general, I'd probably:

  - remove the iSCSI bonding, goto a single 1Gb link.
  - get rid of jumbo frames, if you're using it.
  - can you reduce the size of your bond0 on the storage box?  

I wonder if the switch is having some sort of table over-flow, or is
just having some sort of brain fart and droppping a packet and then
needs time to rebuild it's tables internally to get things going
again?

I'd try to borrow a similar sized switch from another vendor and try
using that instead if you can.  Another thing is to try and use SNMP
to grab stats from the switch and look for patterns.  When you see
connectivity problems, do you see a corresponding drop on one of the
links on the bond0 connection?  Or on another bond?  

But, thinking about it more, you don't mention if you're dropping
packets on the iSCSI side of things, or just on the regular network.
That's a key observation, since it will either suggest, or refute my
idea of the problem being in the bond(s).  

Do you see any errors in the dmesg logs on the Xen/Linux/Windows
boxes?  And when you have an outage between two hosts, do pings to
*other* hosts still work just fine, or does all network traffic on
that host come to a stop?

It really smells of a switch problem.  Have you checked that the
switch firmware is upto date?  It might just be that Netgear makes a
crappy switch (cue people to chime on on this! :-) which can't handle
the load you're tossing at it.  Which is why I suggest you try another
vendor's switch.

Cisco is probably reliable but expensive.  Dell has some ok switches
in my experience, but nothing recent.  I've heard good things about
other brands such as Juniper, Force10 (now Dell) and others.  

Please keep posting, it's great information for the rest of us to keep
in the back of our heads.

John

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance - new kernel results
  2013-04-15 12:23                     ` Adam Goryachev
  2013-04-15 15:31                       ` John Stoffel
@ 2013-04-15 16:49                       ` Roy Sigurd Karlsbakk
  2013-04-15 20:16                       ` Phil Turmel
  2013-04-15 20:42                       ` Stan Hoeppner
  3 siblings, 0 replies; 131+ messages in thread
From: Roy Sigurd Karlsbakk @ 2013-04-15 16:49 UTC (permalink / raw)
  To: Adam Goryachev; +Cc: linux-raid

> The storage server has a single bond0 with 8 x Gbps ethernet
> connections for the iSCSI network

What sort of bonding? 802.3ad or something else? if using the former, this probably won't work on the non-managed switch. if using the latter, then please detail

Vennlige hilsener / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 98013356
roy@karlsbakk.net
http://blogg.karlsbakk.net/
GPG Public key: http://karlsbakk.net/roysigurdkarlsbakk.pubkey.txt
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av idiomer med xenotyp etymologi. I de fleste tilfeller eksisterer adekvate og relevante synonymer på norsk.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance - new kernel results
  2013-04-15 12:23                     ` Adam Goryachev
  2013-04-15 15:31                       ` John Stoffel
  2013-04-15 16:49                       ` Roy Sigurd Karlsbakk
@ 2013-04-15 20:16                       ` Phil Turmel
  2013-04-16 19:28                         ` Roy Sigurd Karlsbakk
  2013-04-15 20:42                       ` Stan Hoeppner
  3 siblings, 1 reply; 131+ messages in thread
From: Phil Turmel @ 2013-04-15 20:16 UTC (permalink / raw)
  To: Adam Goryachev; +Cc: linux-raid

On 04/15/2013 08:23 AM, Adam Goryachev wrote:
> It's been quite a while, and I just wanted to post an update on the
> current status of my problems.

Thanks for updating us.

> As a quick refresh, the users were complaining of freezing, especially
> when using outlook (pst file stored on file server), and sometimes
> corrupted pst files or excel files with windows logging delayed write
> failures.

[trim /]

> After sitting on-site for a few days, I eventually noticed my terminal
> server session (across the LAN) stopped responding, after ping testing,
> I found the server went offline for around 10 seconds before coming back
> and working normally (yes, a total accident I discovered this). I added
> a small script with fping to test all physical machine IP's and all VM
> IP's every second for 60 seconds. Then, it will log the date/time the
> test started, and each IP plus all 60 results for any IP that lost one
> or more packets. (Reminder, this is over the LAN only, no WAN connections).
> 
> I found a "pattern" that showed one (at a time) random IP (VM or
> physical, linux or windows), would stop responding to pings for between
> 10 and 50 seconds, then come back and work normally. These failures
> would happen between zero and three times a day, generally occurring on
> busy servers, either in the morning (users logging in) or afternoon
> (users logging out).
> In addition, random IP's drop a single ping packet around 40 or more
> times per day, during business hours only.
> There is never an outage of between two and 10 pings. There are lots of
> single pings lost, and plenty between 10 and 50, but never any between 1
> and 10. Sometimes (rarely) two or three in one minute, but not consecutive.
> 
> I suspect that the single ping packets being lost are an indication of a
> problem, but this should not impact the users (TCP should look after the
> re-transmission, etc). Wether this is related to the longer 10-50 second
> outage I'm not sure.

No, single lost pings are *not* a sign of a problem.  It is perfectly
normal for a network to have random traffic spikes that fill a switch's
store-and-forward buffers.  ICMP pings are *datagrams*, like UDP, so
they aren't retransmitted when dropped.  Losing them as infrequently as
you say suggests your network isn't heavily loaded.

(Smart switches will attempt to notify hosts of buffer-full conditions,
but that just means the datagram is dropped in the host's IP stack
instead of on the wire.)

Loosing multiple pings as you describe, with matching freezes on UIs,
does sound like a serious problem.

[trim /]

> At this stage, I've moved totally away from suspecting a disk
> performance or similar issue, and I don't think this can get any more
> offtopic, but wanted to post a followup to my issue here. I still intend
> to write something up to summarise the entire process once I eventually
> get it resolved.
> 
> In the meantime, if anyone has any hints or suggestions on why a LAN
> might be dropping packets like this, I'd be really happy to hear it,
> because I'm scraping the bottom. Currently I'm using tcpdump to capture
> ALL network traffic to local disk on 4 machines, and hoping that network
> drop will happen on one of these 4. Then I can use wireshark to see what
> happened during that time. If you've seen anything similar, got a random
> suggestion (no matter how dumb) I'd be happy to hear it please.

Don't forget to put performance/latency monitors in your hosts...  There
might be a hardware issue in a critical node that is triggering this.
This might be visible in your four wireshark machines where they
suddenly fail to record many packets.  In other words, where one machine
sees a gap in traffic, and other machines transmit many retries,
suggests that first machine has an internal problem.
> 
> Regards,
> Adam

HTH,

Phil

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance - new kernel results
  2013-04-15 12:23                     ` Adam Goryachev
                                         ` (2 preceding siblings ...)
  2013-04-15 20:16                       ` Phil Turmel
@ 2013-04-15 20:42                       ` Stan Hoeppner
  3 siblings, 0 replies; 131+ messages in thread
From: Stan Hoeppner @ 2013-04-15 20:42 UTC (permalink / raw)
  To: Adam Goryachev; +Cc: linux-raid

On 4/15/2013 7:23 AM, Adam Goryachev wrote:
...
> I suspect that the single ping packets being lost are an indication of a
> problem, but this should not impact the users (TCP should look after the
> re-transmission, etc). Wether this is related to the longer 10-50 second
> outage I'm not sure.
...
> All network cables were replaced with brand new cat6 cables (1m or 2M)
> about 6 weeks ago.

Server rack cables, but not *all* network cables.

> In the meantime, if anyone has any hints or suggestions on why a LAN
> might be dropping packets like this, I'd be really happy to hear it...

Sounds like a classic ground short or EMI issue.

Inexpensive switches tend to lack per port electrical isolation
circuitry.  Thus if there is a short across any of the 4 cable
conductors the switch may lock up while the condition exists and return
to normal operation when the condition is removed, possibly without
manual intervention required, no reboot or power cycle.  The normal
cause of this is a defective Ethernet transceiver in one of the devices
connected to the switch, or a break in a cable.  Depending on the
transceiver failure mode this can cause intermittent or permanent switch
lockup while the device is connected.  In this case it would seem to be
intermittent, if this is the cause of your problems.  This phenomenon is
limited to copper links.  Fiber doesn't conduct electricity.

Category 5/6 cable is pretty good at rejecting EMI/RFI.  But if the
cable is run in close proximity to anything generating a significant
stray magnetic field of the necessary amplitude, this may cause
electromagnetic induction in the cables, generating a current.  This
current can cause problems in the switch similar to that described above
if the switch isn't designed to deal with such conditions.

-- 
Stan


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance - new kernel results
  2013-04-15 20:16                       ` Phil Turmel
@ 2013-04-16 19:28                         ` Roy Sigurd Karlsbakk
  2013-04-16 21:03                           ` Phil Turmel
  2013-04-16 21:43                           ` Stan Hoeppner
  0 siblings, 2 replies; 131+ messages in thread
From: Roy Sigurd Karlsbakk @ 2013-04-16 19:28 UTC (permalink / raw)
  To: Phil Turmel; +Cc: linux-raid, Adam Goryachev

> > I suspect that the single ping packets being lost are an indication
> > of a
> > problem, but this should not impact the users (TCP should look after
> > the
> > re-transmission, etc). Wether this is related to the longer 10-50
> > second
> > outage I'm not sure.
> 
> No, single lost pings are *not* a sign of a problem. It is perfectly
> normal for a network to have random traffic spikes that fill a
> switch's
> store-and-forward buffers. ICMP pings are *datagrams*, like UDP, so
> they aren't retransmitted when dropped. Losing them as infrequently as
> you say suggests your network isn't heavily loaded.

Switches (unlike bridges) do not use store-and-forward. They use cut-through, meaning they use store-and-forward for the initial packet from A to B and then store the path and switch it later, sniffing the MAC addresses and just use pass-through.

As was said, the traffic on the network was minimal, so I really doubt this had an impact. Getting 30 seconds+ of drops must come from a bad network stack or a really bad switch, but then again, two switches were tested, so I doubt the switches alone could do that.

What may be doing it, is bad (or perhaps incompatible) bonding setup.

Vennlige hilsener / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 98013356
roy@karlsbakk.net
http://blogg.karlsbakk.net/
GPG Public key: http://karlsbakk.net/roysigurdkarlsbakk.pubkey.txt
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av idiomer med xenotyp etymologi. I de fleste tilfeller eksisterer adekvate og relevante synonymer på norsk.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance - new kernel results
  2013-04-16 19:28                         ` Roy Sigurd Karlsbakk
@ 2013-04-16 21:03                           ` Phil Turmel
  2013-04-16 21:43                           ` Stan Hoeppner
  1 sibling, 0 replies; 131+ messages in thread
From: Phil Turmel @ 2013-04-16 21:03 UTC (permalink / raw)
  To: Roy Sigurd Karlsbakk; +Cc: linux-raid, Adam Goryachev

On 04/16/2013 03:28 PM, Roy Sigurd Karlsbakk wrote:
>>> I suspect that the single ping packets being lost are an
>>> indication of a problem, but this should not impact the users
>>> (TCP should look after the re-transmission, etc). Wether this is
>>> related to the longer 10-50 second outage I'm not sure.
>> 
>> No, single lost pings are *not* a sign of a problem. It is
>> perfectly normal for a network to have random traffic spikes that
>> fill a switch's store-and-forward buffers. ICMP pings are
>> *datagrams*, like UDP, so they aren't retransmitted when dropped.
>> Losing them as infrequently as you say suggests your network isn't
>> heavily loaded.
> 
> Switches (unlike bridges) do not use store-and-forward. They use
> cut-through, meaning they use store-and-forward for the initial
> packet from A to B and then store the path and switch it later,
> sniffing the MAC addresses and just use pass-through.

Nothing you said changes my statement that switches often drop single
packets.  The occasional dropped ping is a red herring.  A cheap switch
that can't ever buffer will simply drop *more* random packets.

> As was said, the traffic on the network was minimal, so I really
> doubt this had an impact. Getting 30 seconds+ of drops must come from
> a bad network stack or a really bad switch, but then again, two
> switches were tested, so I doubt the switches alone could do that.

We seem to violently agree here.  Multiple consecutive drops is a real
problem.

> What may be doing it, is bad (or perhaps incompatible) bonding
> setup.

My point was to not prematurely conclude that the problem is in the network.

Phil

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance - new kernel results
  2013-04-16 19:28                         ` Roy Sigurd Karlsbakk
  2013-04-16 21:03                           ` Phil Turmel
@ 2013-04-16 21:43                           ` Stan Hoeppner
  1 sibling, 0 replies; 131+ messages in thread
From: Stan Hoeppner @ 2013-04-16 21:43 UTC (permalink / raw)
  To: Roy Sigurd Karlsbakk; +Cc: Phil Turmel, linux-raid, Adam Goryachev

On 4/16/2013 2:28 PM, Roy Sigurd Karlsbakk wrote:

> Switches (unlike bridges) do not use store-and-forward. They use cut-through...

This is incorrect Roy.  The default forwarding mode of all IEEE 802.3
compliant switches is store and forward.  All unmanaged switches use
this forwarding mode.  Cut-through is an optimization and is optional
only on managed switches.

And these aren't the only two forwarding modes in managed switches
today.  Switch routers do layer 3 as well as 2 forwarding, and switches
offering VLANs, QOS, and other features use different forwarding methods
still.  But TTBOMK, all vendors' managed switches default to store and
forward, as that is the IEEE 802.3 standard.

-- 
Stan




^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: RAID performance - new kernel results
  2013-04-15 15:31                       ` John Stoffel
@ 2013-04-17 10:15                         ` Adam Goryachev
  0 siblings, 0 replies; 131+ messages in thread
From: Adam Goryachev @ 2013-04-17 10:15 UTC (permalink / raw)
  To: John Stoffel; +Cc: linux-raid

I'm amalgamating a few different replies into a single post here to
reduce the noise on-list...

On 16/04/13 01:31, John Stoffel wrote:
> Now, from looking over your report, it strongly smells of a problem in
> the network switch.  I think you have just one in the core of your
> network, correct?  I'd probably try to bring up a test network (if you
> have the spare systems in a lab) and try to replicate the packet
> drops.

Definitely not unfortunately.... Besides the main issue would be to
generate sufficient load to cause it to happen... I'm fairly sure that
it is load related, since it only happens during the day, and generally
at times when a lot of users are logging in, or logging out.

However, there are about 5 switches in the network in total.
Switch 1 is the "core", it connects directly to switch 2, 3 and 4 (or 5,
4 is the unmanaged switch, 5 is the managed switch).

Switches 1, 2 and 3 connect to various
workstations/printers/devices/routers/etc.

Switch 4/5 connects to ALL the servers that are under
analysis/discussion here.

Then, there is switch 6, which is a 48 port managed switch, and it
connects the 8 ports from san1 (storage server 1) and 8 ports from san2,
and 2 ports from each of the 8 machines. (Total 32 ports used). There is
no connection from this switch to any other switch, it is a separate
subnet/isolated network just for iSCSI.

> But in general, I'd probably:
>   - remove the iSCSI bonding, goto a single 1Gb link.

Not relevant, iSCSI is on a different network

>   - get rid of jumbo frames, if you're using it.

No jumbo frames on this network, jumbo frames is on the iSCSI network only.

>   - can you reduce the size of your bond0 on the storage box?  

There are no bonded ethernet on the network with an issue.

> I wonder if the switch is having some sort of table over-flow, or is
> just having some sort of brain fart and droppping a packet and then
> needs time to rebuild it's tables internally to get things going
> again?

Quite possible, although when using the smart switch, it said the MAC
address table had a maximum ever number of 64 entries, I didn't bother
looking up the specs, but I'm sure they usually support at least 1000
entries... The current number of learned entries is 45.

> I'd try to borrow a similar sized switch from another vendor and try
> using that instead if you can.  Another thing is to try and use SNMP
> to grab stats from the switch and look for patterns.  When you see
> connectivity problems, do you see a corresponding drop on one of the
> links on the bond0 connection?  Or on another bond?  

There are definitely no link drops (on any of the networks), because
Linux never logs the link drop on any ethernet interface, and I presume
windows would also log that as an event, but in any case, none of the
linux PC's which have been affected have recorded that.

> But, thinking about it more, you don't mention if you're dropping
> packets on the iSCSI side of things, or just on the regular network.
> That's a key observation, since it will either suggest, or refute my
> idea of the problem being in the bond(s).  

Right, packet loss is only happening on the regular network. Although
I've not done any ping tests on the iSCSI network, everything seems to
be working and performing perfectly with every test I do, and there are
no complaints whose cause could be blamed as a disk/iSCSI issue. So I
don't suspect any issue on the iSCSI network at this stage.

> Do you see any errors in the dmesg logs on the Xen/Linux/Windows
> boxes?  And when you have an outage between two hosts, do pings to
> *other* hosts still work just fine, or does all network traffic on
> that host come to a stop?

It is interesting...
1) pings from host1 to host2 shows no replies for a period of 10 seconds.
2) pings from host3 to host2 shows no replies for the same period
3) traffic sniffing (with tcpdump) and analysis (with wireshark) for the
physical ethernet interface of a machine which is running a windows VM
shows a *lot* of TCP retransmissions, and some of the ICMP requests are
seen (but not all, and obviously most traffic quickly dies off due to no
ACK's being sent out). In addition, a very small number of outbound
packets can be seen, including some ICMP replies, but the remote party
clearly never received it.
4) During the same period, every ICMP request/reply to the physical
machine is successful.

> It really smells of a switch problem.  Have you checked that the
> switch firmware is upto date?

Yes, switch firmware is up to date

> It might just be that Netgear makes a
> crappy switch (cue people to chime on on this! :-) which can't handle
> the load you're tossing at it.  Which is why I suggest you try another
> vendor's switch.

I wouldn't suggest netgear make the best equipment ever, but I've used
their switches for many years, and in many customers networks and have
yet to have a real problem. Certainly, a couple have eventually died,
but never really had a problem with one before now. Also, have of course
tried two different models.

> Cisco is probably reliable but expensive.  Dell has some ok switches
> in my experience, but nothing recent.  I've heard good things about
> other brands such as Juniper, Force10 (now Dell) and others.  

I would certainly be happy to buy another switch, of any brand, even
cisco if it could solve the problem. The issue is that the chance it
will solve the problem seems so small, that it would result in a waste
of money possibly better spent elsewhere.

On 16/04/13 02:35, Romain Francoise wrote:
> Total shot in the dark, but maybe you're seeing the effect of an
> interface changing its MAC address. This typically happens with
> bridge interfaces, which can change their MAC when you add a new
> member, the highest-numbered address gets used automatically.
> 
> When the MAC address changes, all the other hosts have to do a new
> ARP resolution to update their tables, which causes a few seconds of
> delay.

Using the above tcpdump/wireshark, I can see the ARP requests prior to
the outage, and after the outage, and the MAC address matches. In
addition, there is no network changes generally, no machines being
rebooted, etc. It's a pretty stable network overall. (This excludes the
various workstations/etc that are actually on the same network/broadcast
domain, they are regularly rebooted by the user etc as needed).

On 16/04/13 02:49, Roy Sigurd Karlsbakk wrote:
> What sort of bonding? 802.3ad or something else? if using the
> former, this probably won't work on the non-managed switch. if using
> the latter, then please detail

I'm using bond-mode balance-alb, but as mentioned, that is on the iSCSI
network, so it has no impact/relevance here (well, it should not).

On 17/04/13 07:03, Phil Turmel wrote:> On 04/16/2013 03:28 PM, Roy
Sigurd Karlsbakk wrote:
>>>> I suspect that the single ping packets being lost are an
>>>> indication of a problem, but this should not impact the users
>>>> (TCP should look after the re-transmission, etc). Wether this is
>>>> related to the longer 10-50 second outage I'm not sure.
>>>
>>> No, single lost pings are *not* a sign of a problem. It is
>>> perfectly normal for a network to have random traffic spikes that
>>> fill a switch's store-and-forward buffers. ICMP pings are
>>> *datagrams*, like UDP, so they aren't retransmitted when dropped.
>>> Losing them as infrequently as you say suggests your network isn't
>>> heavily loaded.
>>
>> Switches (unlike bridges) do not use store-and-forward. They use
>> cut-through, meaning they use store-and-forward for the initial
>> packet from A to B and then store the path and switch it later,
>> sniffing the MAC addresses and just use pass-through.
>
> Nothing you said changes my statement that switches often drop single
> packets.  The occasional dropped ping is a red herring.  A cheap
> switch that can't ever buffer will simply drop *more* random packets.

However, if the switch dropped the packet, it should be counted, yet the
switch is reporting 0 dropped packets across every port (which is what I
would expect, this isn't, or shouldn't, be a very busy network).

>> As was said, the traffic on the network was minimal, so I really
>> doubt this had an impact. Getting 30 seconds+ of drops must come from
>> a bad network stack or a really bad switch, but then again, two
>> switches were tested, so I doubt the switches alone could do that.
> We seem to violently agree here.  Multiple consecutive drops is a real
> problem.

Right, and this is really the only reason I'm even noticing the
occasional single packet being dropped.....

>> What may be doing it, is bad (or perhaps incompatible) bonding
>> setup.
> My point was to not prematurely conclude that the problem is in the
> network.

It can't (I don't think) be bonding, since the bonded interfaces are on
the other network. This network that I'm seeing the packet loss on
doesn't have any machine with more than a single 1Gbps ethernet
interface connected to this switch.

Given the number of switches on the network, and even though the packet
loss is mostly happening on only one of those switches, could it be a
STP mis-configuration of some sort?

The topology is reasonably flat:
Unmanaged Switch = US
Managed Switch = MS
Linux Bridge = LB

     US1
   /  |  \
 US2 US3 MS5
        / |  \
      LB1 LB2 LB3

Note: The Linux Bridge is configured in debian /etc/network/interfaces
auto xenbr0
iface xenbr0 inet static
	address 10.2.2.3
	netmask 255.255.240.0
	gateway 10.2.2.254
	bridge_maxwait 5
	bridge_ports regex eth0

When the VM is created, an additional interface is created and added to
the bridge, but this is not done during the day (or even very often at
night).....

The managed switch has the following config:
Spanning Tree State: Disable
STP Operation Mode: STP RSTP MSTP (Selected option is MSTP)
Configuration Name:
Configuration Revision Level: 0 (Valid values 0 - 65535)
Configuration Digest Key:
BPDU Flooding: All (or specific port number) Disable/Enable (Selected
option is Disable)

The "System Log" on the switch is full of useless "SNTP system clock
synchronized" messages every 10 minutes or so.



Regards,
Adam

-- 
Adam Goryachev
Website Managers
www.websitemanagers.com.au


^ permalink raw reply	[flat|nested] 131+ messages in thread

end of thread, other threads:[~2013-04-17 10:15 UTC | newest]

Thread overview: 131+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-02-07  6:48 RAID performance Adam Goryachev
2013-02-07  6:51 ` Adam Goryachev
2013-02-07  8:24   ` Stan Hoeppner
2013-02-07  7:02 ` Carsten Aulbert
2013-02-07 10:12   ` Adam Goryachev
2013-02-07 10:29     ` Carsten Aulbert
2013-02-07 10:41       ` Adam Goryachev
2013-02-07  8:11 ` Stan Hoeppner
2013-02-07 10:05   ` Adam Goryachev
2013-02-16  4:33     ` RAID performance - *Slow SSDs likely solved* Stan Hoeppner
     [not found]       ` <cfefe7a6-a13f-413c-9e3d-e061c68dc01b@email.android.com>
2013-02-17  5:01         ` Stan Hoeppner
2013-02-08  7:21   ` RAID performance Adam Goryachev
2013-02-08  7:37     ` Chris Murphy
2013-02-08 13:04     ` Stan Hoeppner
2013-02-07  9:07 ` Dave Cundiff
2013-02-07 10:19   ` Adam Goryachev
2013-02-07 11:07     ` Dave Cundiff
2013-02-07 12:49       ` Adam Goryachev
2013-02-07 12:53         ` Phil Turmel
2013-02-07 12:58           ` Adam Goryachev
2013-02-07 13:03             ` Phil Turmel
2013-02-07 13:08               ` Adam Goryachev
2013-02-07 13:20                 ` Mikael Abrahamsson
2013-02-07 22:03               ` Chris Murphy
2013-02-07 23:48                 ` Chris Murphy
2013-02-08  0:02                   ` Chris Murphy
2013-02-08  6:25                     ` Adam Goryachev
2013-02-08  7:35                       ` Chris Murphy
2013-02-08  8:34                         ` Chris Murphy
2013-02-08 14:31                           ` Adam Goryachev
2013-02-08 14:19                         ` Adam Goryachev
2013-02-08  6:15                   ` Adam Goryachev
2013-02-07 15:32         ` Dave Cundiff
2013-02-08 13:58           ` Adam Goryachev
2013-02-08 21:42             ` Stan Hoeppner
2013-02-14 22:42               ` Chris Murphy
2013-02-15  1:10                 ` Adam Goryachev
2013-02-15  1:40                   ` Chris Murphy
2013-02-15  4:01                     ` Adam Goryachev
2013-02-15  5:14                       ` Chris Murphy
2013-02-15 11:10                         ` Adam Goryachev
2013-02-15 23:01                           ` Chris Murphy
2013-02-17  9:52             ` RAID performance - new kernel results Adam Goryachev
2013-02-18 13:20               ` RAID performance - new kernel results - 5x SSD RAID5 Stan Hoeppner
2013-02-20 17:10                 ` Adam Goryachev
2013-02-21  6:04                   ` Stan Hoeppner
2013-02-21  6:40                     ` Adam Goryachev
2013-02-21  8:47                       ` Joseph Glanville
2013-02-22  8:10                       ` Stan Hoeppner
2013-02-24 20:36                         ` Stan Hoeppner
2013-03-01 16:06                           ` Adam Goryachev
2013-03-02  9:15                             ` Stan Hoeppner
2013-03-02 17:07                               ` Phil Turmel
2013-03-02 23:48                                 ` Stan Hoeppner
2013-03-03  2:35                                   ` Phil Turmel
2013-03-03 15:19                                 ` Adam Goryachev
2013-03-04  1:31                                   ` Phil Turmel
2013-03-04  9:39                                     ` Adam Goryachev
2013-03-04 12:41                                       ` Phil Turmel
2013-03-04 12:42                                       ` Stan Hoeppner
2013-03-04  5:25                                   ` Stan Hoeppner
2013-03-03 17:32                               ` Adam Goryachev
2013-03-04 12:20                                 ` Stan Hoeppner
2013-03-04 16:26                                   ` Adam Goryachev
2013-03-05  9:30                                     ` RAID performance - 5x SSD RAID5 - effects of stripe cache sizing Stan Hoeppner
2013-03-05 15:53                                       ` Adam Goryachev
2013-03-07  7:36                                         ` Stan Hoeppner
2013-03-08  0:17                                           ` Adam Goryachev
2013-03-08  4:02                                             ` Stan Hoeppner
2013-03-08  5:57                                               ` Mikael Abrahamsson
2013-03-08 10:09                                                 ` Stan Hoeppner
2013-03-08 14:11                                                   ` Mikael Abrahamsson
2013-02-21 17:41                     ` RAID performance - new kernel results - 5x SSD RAID5 David Brown
2013-02-23  6:41                       ` Stan Hoeppner
2013-02-23 15:57               ` RAID performance - new kernel results John Stoffel
2013-03-01 16:10                 ` Adam Goryachev
2013-03-10 15:35                   ` Charles Polisher
2013-04-15 12:23                     ` Adam Goryachev
2013-04-15 15:31                       ` John Stoffel
2013-04-17 10:15                         ` Adam Goryachev
2013-04-15 16:49                       ` Roy Sigurd Karlsbakk
2013-04-15 20:16                       ` Phil Turmel
2013-04-16 19:28                         ` Roy Sigurd Karlsbakk
2013-04-16 21:03                           ` Phil Turmel
2013-04-16 21:43                           ` Stan Hoeppner
2013-04-15 20:42                       ` Stan Hoeppner
2013-02-08  3:32       ` RAID performance Stan Hoeppner
2013-02-08  7:11         ` Adam Goryachev
2013-02-08 17:10           ` Stan Hoeppner
2013-02-08 18:44             ` Adam Goryachev
2013-02-09  4:09               ` Stan Hoeppner
2013-02-10  4:40                 ` Adam Goryachev
2013-02-10 13:22                   ` Stan Hoeppner
2013-02-10 16:16                     ` Adam Goryachev
2013-02-10 17:19                       ` Mikael Abrahamsson
2013-02-10 21:57                         ` Adam Goryachev
2013-02-11  3:41                           ` Adam Goryachev
2013-02-11  4:33                           ` Mikael Abrahamsson
2013-02-12  2:46                       ` Stan Hoeppner
2013-02-12  5:33                         ` Adam Goryachev
2013-02-13  7:56                           ` Stan Hoeppner
2013-02-13 13:48                             ` Phil Turmel
2013-02-13 16:17                             ` Adam Goryachev
2013-02-13 20:20                               ` Adam Goryachev
2013-02-14 12:22                                 ` Stan Hoeppner
2013-02-15 13:31                                   ` Stan Hoeppner
2013-02-15 14:32                                     ` Adam Goryachev
2013-02-16  1:07                                       ` Stan Hoeppner
2013-02-16 17:19                                         ` Adam Goryachev
2013-02-17  1:42                                           ` Stan Hoeppner
2013-02-17  5:02                                             ` Adam Goryachev
2013-02-17  6:28                                               ` Stan Hoeppner
2013-02-17  8:41                                                 ` Adam Goryachev
2013-02-17 13:58                                                   ` Stan Hoeppner
2013-02-17 14:46                                                     ` Adam Goryachev
2013-02-19  8:17                                                       ` Stan Hoeppner
2013-02-20 16:45                                                         ` Adam Goryachev
2013-02-21  0:45                                                           ` Stan Hoeppner
2013-02-21  3:10                                                             ` Adam Goryachev
2013-02-22 11:19                                                               ` Stan Hoeppner
2013-02-22 15:25                                                                 ` Charles Polisher
2013-02-23  4:14                                                                   ` Stan Hoeppner
2013-02-12  7:34                         ` Mikael Abrahamsson
2013-02-08  7:17         ` Adam Goryachev
2013-02-07 12:01     ` Brad Campbell
2013-02-07 12:37       ` Adam Goryachev
2013-02-07 17:12         ` Fredrik Lindgren
2013-02-08  0:00           ` Adam Goryachev
2013-02-11 19:49   ` Roy Sigurd Karlsbakk
2013-02-11 20:30     ` Dave Cundiff
2013-02-07 11:32 ` Mikael Abrahamsson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.