All of lore.kernel.org
 help / color / mirror / Atom feed
* Shell Scripts or Arbitrary Priority Callouts?
@ 2009-03-18 22:57 Christopher Chen
  2009-03-19 11:04 ` John A. Sullivan III
  0 siblings, 1 reply; 34+ messages in thread
From: Christopher Chen @ 2009-03-18 22:57 UTC (permalink / raw)
  To: dm-devel

Hello, running Centos 5.2 here--

The multipathd daemon is very unhappy with any arbritrary script I
provide for determining priorities. I see some fuzz in the syslog
about ramfs and static binaries.

How do I use shell scripts or arbitrary programs for multipathd? I
compiled a simple program that spits out "1" and it seems to return
appropriately.

Also, why does multipath -ll return the appropriate data, namely
prio=1 (when using my custom statically compiled callout) and
multipath -l always returns prio=0? Is this an indication of a broken
configuration or something else?

Cheers

cc

-- 
Chris Chen <muffaleta@gmail.com>
"I want the kind of six pack you can't drink."
-- Micah

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Shell Scripts or Arbitrary Priority Callouts?
  2009-03-18 22:57 Shell Scripts or Arbitrary Priority Callouts? Christopher Chen
@ 2009-03-19 11:04 ` John A. Sullivan III
  2009-03-20  5:11   ` Christopher Chen
  0 siblings, 1 reply; 34+ messages in thread
From: John A. Sullivan III @ 2009-03-19 11:04 UTC (permalink / raw)
  To: device-mapper development

On Wed, 2009-03-18 at 15:57 -0700, Christopher Chen wrote:
> Hello, running Centos 5.2 here--
> 
> The multipathd daemon is very unhappy with any arbritrary script I
> provide for determining priorities. I see some fuzz in the syslog
> about ramfs and static binaries.
> 
> How do I use shell scripts or arbitrary programs for multipathd? I
> compiled a simple program that spits out "1" and it seems to return
> appropriately.
> 
> Also, why does multipath -ll return the appropriate data, namely
> prio=1 (when using my custom statically compiled callout) and
> multipath -l always returns prio=0? Is this an indication of a broken
> configuration or something else?
> 
> Cheers
> 
> cc
> 
I had the exact same problem and someone kindly explained it on this
list so thanks to them.

If I understand it correctly, multipathd must be prepared to function if
it loses access to disk.  Therefore, it is designed to not read from
disk but caches everything it needs in memory.  Apparently, it can only
cache binaries.

To use a shell script, call it via the shell, i.e., rather than
shell.script call sh shell.script.

That worked perfectly fine for me.  However, I do not know if multipathd
actually caches shell.script or if it still must read it from disk when
invoking sh and hence remains vulnerable to loss of disk access.  Does
anyone know? Thanks - John
-- 
John A. Sullivan III
Open Source Development Corporation
+1 207-985-7880
jsullivan@opensourcedevel.com

http://www.spiritualoutreach.com
Making Christianity intelligible to secular society

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Shell Scripts or Arbitrary Priority Callouts?
  2009-03-19 11:04 ` John A. Sullivan III
@ 2009-03-20  5:11   ` Christopher Chen
  2009-03-20 10:01     ` John A. Sullivan III
  0 siblings, 1 reply; 34+ messages in thread
From: Christopher Chen @ 2009-03-20  5:11 UTC (permalink / raw)
  To: device-mapper development

On Thu, Mar 19, 2009 at 4:04 AM, John A. Sullivan III
<jsullivan@opensourcedevel.com> wrote:
> On Wed, 2009-03-18 at 15:57 -0700, Christopher Chen wrote:
>> Hello, running Centos 5.2 here--
>>
>> The multipathd daemon is very unhappy with any arbritrary script I
>> provide for determining priorities. I see some fuzz in the syslog
>> about ramfs and static binaries.
>>
>> How do I use shell scripts or arbitrary programs for multipathd? I
>> compiled a simple program that spits out "1" and it seems to return
>> appropriately.
>>
>> Also, why does multipath -ll return the appropriate data, namely
>> prio=1 (when using my custom statically compiled callout) and
>> multipath -l always returns prio=0? Is this an indication of a broken
>> configuration or something else?
>>
>> Cheers
>>
>> cc
>>
> I had the exact same problem and someone kindly explained it on this
> list so thanks to them.
>
> If I understand it correctly, multipathd must be prepared to function if
> it loses access to disk.  Therefore, it is designed to not read from
> disk but caches everything it needs in memory.  Apparently, it can only
> cache binaries.
>
> To use a shell script, call it via the shell, i.e., rather than
> shell.script call sh shell.script.
>
> That worked perfectly fine for me.  However, I do not know if multipathd
> actually caches shell.script or if it still must read it from disk when
> invoking sh and hence remains vulnerable to loss of disk access.  Does
> anyone know? Thanks - John

John:

Thanks for the reply.

I ended up writing a small C program to do the priority computation for me.

I have two sets of FC-AL shelves attached to two dual-channel Qlogic
cards. That gives me two paths to each disk. I have about 56 spindles
in the current configuration, and am tying them together with md
software raid.

Now, even though each disk says it handles concurrent I/O on each
port, my testing indicates that throughput suffers when using multibus
by about 1/2 (from ~60 MB/sec sustained I/O with failover to 35 MB/sec
when using multibus).

However, with failover, I am effectively using only one channel on
each card. With my custom priority callout, I more or less match the
disks with even numbers to the even numbered scsi channels with a
higher priority. Same with the odd numbered disks and odd numbered
channels. The odds are 2ndary on even and vice versa. It seems to work
rather well, and appears to spread the load nicely.

Thanks again for your help!

-- 
Chris Chen <muffaleta@gmail.com>
"I want the kind of six pack you can't drink."
-- Micah

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Shell Scripts or Arbitrary Priority Callouts?
  2009-03-20  5:11   ` Christopher Chen
@ 2009-03-20 10:01     ` John A. Sullivan III
  2009-03-22 15:27       ` Pasi Kärkkäinen
  0 siblings, 1 reply; 34+ messages in thread
From: John A. Sullivan III @ 2009-03-20 10:01 UTC (permalink / raw)
  To: device-mapper development

On Thu, 2009-03-19 at 22:11 -0700, Christopher Chen wrote:
> On Thu, Mar 19, 2009 at 4:04 AM, John A. Sullivan III
> <jsullivan@opensourcedevel.com> wrote:
> > On Wed, 2009-03-18 at 15:57 -0700, Christopher Chen wrote:
> >> Hello, running Centos 5.2 here--
> >>
> >> The multipathd daemon is very unhappy with any arbritrary script I
> >> provide for determining priorities. I see some fuzz in the syslog
> >> about ramfs and static binaries.
> >>
> >> How do I use shell scripts or arbitrary programs for multipathd? I
> >> compiled a simple program that spits out "1" and it seems to return
> >> appropriately.
> >>
> >> Also, why does multipath -ll return the appropriate data, namely
> >> prio=1 (when using my custom statically compiled callout) and
> >> multipath -l always returns prio=0? Is this an indication of a broken
> >> configuration or something else?
> >>
> >> Cheers
> >>
> >> cc
> >>
> > I had the exact same problem and someone kindly explained it on this
> > list so thanks to them.
> >
> > If I understand it correctly, multipathd must be prepared to function if
> > it loses access to disk.  Therefore, it is designed to not read from
> > disk but caches everything it needs in memory.  Apparently, it can only
> > cache binaries.
> >
> > To use a shell script, call it via the shell, i.e., rather than
> > shell.script call sh shell.script.
> >
> > That worked perfectly fine for me.  However, I do not know if multipathd
> > actually caches shell.script or if it still must read it from disk when
> > invoking sh and hence remains vulnerable to loss of disk access.  Does
> > anyone know? Thanks - John
> 
> John:
> 
> Thanks for the reply.
> 
> I ended up writing a small C program to do the priority computation for me.
> 
> I have two sets of FC-AL shelves attached to two dual-channel Qlogic
> cards. That gives me two paths to each disk. I have about 56 spindles
> in the current configuration, and am tying them together with md
> software raid.
> 
> Now, even though each disk says it handles concurrent I/O on each
> port, my testing indicates that throughput suffers when using multibus
> by about 1/2 (from ~60 MB/sec sustained I/O with failover to 35 MB/sec
> when using multibus).
> 
> However, with failover, I am effectively using only one channel on
> each card. With my custom priority callout, I more or less match the
> disks with even numbers to the even numbered scsi channels with a
> higher priority. Same with the odd numbered disks and odd numbered
> channels. The odds are 2ndary on even and vice versa. It seems to work
> rather well, and appears to spread the load nicely.
> 
> Thanks again for your help!
> 
I'm really glad you brought up the performance problem. I had posted
about it a few days ago but it seems to have gotten lost.  We are really
struggling with performance issues when attempting to combine multiple
paths (in the case of multipath to one big target) or targets (in the
case of software RAID0 across several targets) rather than using, in
effect, JBODs.  In our case, we are using iSCSI.

Like you, we found that using multibus caused almost a linear drop in
performance.  Round robin across two paths was half as much as aggregate
throughput to two separate disks, four paths, one fourth.

We also tried striping across the targets with software RAID0 combined
with failover multipath - roughly the same effect.

We really don't want to be forced to treated SAN attached disks as
JDOBs.  Has anyone cracked this problem of using them in either multibus
or RAID0 so we can present them as a single device to the OS and still
load balance multiple paths.  This is a HUGE problem for us so any help
is greatly appreciated.  Thanks- John
-- 
John A. Sullivan III
Open Source Development Corporation
+1 207-985-7880
jsullivan@opensourcedevel.com

http://www.spiritualoutreach.com
Making Christianity intelligible to secular society

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Shell Scripts or Arbitrary Priority Callouts?
  2009-03-20 10:01     ` John A. Sullivan III
@ 2009-03-22 15:27       ` Pasi Kärkkäinen
  2009-03-22 16:50         ` John A. Sullivan III
                           ` (2 more replies)
  0 siblings, 3 replies; 34+ messages in thread
From: Pasi Kärkkäinen @ 2009-03-22 15:27 UTC (permalink / raw)
  To: device-mapper development; +Cc: Ross S. W. Walker

On Fri, Mar 20, 2009 at 06:01:23AM -0400, John A. Sullivan III wrote:
> > 
> > John:
> > 
> > Thanks for the reply.
> > 
> > I ended up writing a small C program to do the priority computation for me.
> > 
> > I have two sets of FC-AL shelves attached to two dual-channel Qlogic
> > cards. That gives me two paths to each disk. I have about 56 spindles
> > in the current configuration, and am tying them together with md
> > software raid.
> > 
> > Now, even though each disk says it handles concurrent I/O on each
> > port, my testing indicates that throughput suffers when using multibus
> > by about 1/2 (from ~60 MB/sec sustained I/O with failover to 35 MB/sec
> > when using multibus).
> > 
> > However, with failover, I am effectively using only one channel on
> > each card. With my custom priority callout, I more or less match the
> > disks with even numbers to the even numbered scsi channels with a
> > higher priority. Same with the odd numbered disks and odd numbered
> > channels. The odds are 2ndary on even and vice versa. It seems to work
> > rather well, and appears to spread the load nicely.
> > 
> > Thanks again for your help!
> > 
> I'm really glad you brought up the performance problem. I had posted
> about it a few days ago but it seems to have gotten lost.  We are really
> struggling with performance issues when attempting to combine multiple
> paths (in the case of multipath to one big target) or targets (in the
> case of software RAID0 across several targets) rather than using, in
> effect, JBODs.  In our case, we are using iSCSI.
> 
> Like you, we found that using multibus caused almost a linear drop in
> performance.  Round robin across two paths was half as much as aggregate
> throughput to two separate disks, four paths, one fourth.
> 
> We also tried striping across the targets with software RAID0 combined
> with failover multipath - roughly the same effect.
> 
> We really don't want to be forced to treated SAN attached disks as
> JDOBs.  Has anyone cracked this problem of using them in either multibus
> or RAID0 so we can present them as a single device to the OS and still
> load balance multiple paths.  This is a HUGE problem for us so any help
> is greatly appreciated.  Thanks- John

Hello.

Hmm.. just a guess, but could this be related to the fact that if your paths
to the storage are different iSCSI sessions (open-iscsi _doesn't_ support
multiple connections per session aka MC/s), then there is a separate SCSI
command queue per path.. and if SCSI requests are split across those queues 
they can get out-of-order and that causes performance drop?

See:
http://www.nabble.com/round-robin-with-vmware-initiator-and-iscsi-target-td21958346.html

Especially the reply from Ross (CC). Maybe he has some comments :) 

-- Pasi

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Shell Scripts or Arbitrary Priority Callouts?
  2009-03-22 15:27       ` Pasi Kärkkäinen
@ 2009-03-22 16:50         ` John A. Sullivan III
  2009-03-23  4:42         ` Christopher Chen
  2009-03-23  9:46         ` John A. Sullivan III
  2 siblings, 0 replies; 34+ messages in thread
From: John A. Sullivan III @ 2009-03-22 16:50 UTC (permalink / raw)
  To: device-mapper development; +Cc: Ross S. W. Walker

On Sun, 2009-03-22 at 17:27 +0200, Pasi Kärkkäinen wrote:
> <snip>> > Now, even though each disk says it handles concurrent I/O on each
> > > port, my testing indicates that throughput suffers when using multibus
> > > by about 1/2 (from ~60 MB/sec sustained I/O with failover to 35 MB/sec
> > > when using multibus).
> > > 
> > > However, with failover, I am effectively using only one channel on
> > > each card. With my custom priority callout, I more or less match the
> > > disks with even numbers to the even numbered scsi channels with a
> > > higher priority. Same with the odd numbered disks and odd numbered
> > > channels. The odds are 2ndary on even and vice versa. It seems to work
> > > rather well, and appears to spread the load nicely.
> > > 
> > > Thanks again for your help!
> > > 
> > I'm really glad you brought up the performance problem. I had posted
> > about it a few days ago but it seems to have gotten lost.  We are really
> > struggling with performance issues when attempting to combine multiple
> > paths (in the case of multipath to one big target) or targets (in the
> > case of software RAID0 across several targets) rather than using, in
> > effect, JBODs.  In our case, we are using iSCSI.
> > 
> > Like you, we found that using multibus caused almost a linear drop in
> > performance.  Round robin across two paths was half as much as aggregate
> > throughput to two separate disks, four paths, one fourth.
> > 
> > We also tried striping across the targets with software RAID0 combined
> > with failover multipath - roughly the same effect.
> > 
> > We really don't want to be forced to treated SAN attached disks as
> > JDOBs.  Has anyone cracked this problem of using them in either multibus
> > or RAID0 so we can present them as a single device to the OS and still
> > load balance multiple paths.  This is a HUGE problem for us so any help
> > is greatly appreciated.  Thanks- John
> 
> Hello.
> 
> Hmm.. just a guess, but could this be related to the fact that if your paths
> to the storage are different iSCSI sessions (open-iscsi _doesn't_ support
> multiple connections per session aka MC/s), then there is a separate SCSI
> command queue per path.. and if SCSI requests are split across those queues 
> they can get out-of-order and that causes performance drop?
> 
> See:
> http://www.nabble.com/round-robin-with-vmware-initiator-and-iscsi-target-td21958346.html
> 
> Especially the reply from Ross (CC). Maybe he has some comments :) 
<snip>
That makes sense and would explain what we are seeing with multipath but
why would we see the same thing with dmadm and using RAID0 to stripe
across multiple iSCSI targets? I would think that, just like increasing
physical spindles increases performance, increasing iSCSI sessions would
also increase performance.

On a side note, we did discover quite a bit about the influence of the
I/O scheduler last night.  We found that there was only marginal
difference between cfq, deadline, anticipatory, and noop when running a
single thread.  However, when running multiple threads, cfq did not
scale at all; performance for 10 threads was the same as for one - in
our case, roughly 6900 IOPS at 512 bytes block size for sequential read.
The other schedulers scaled almost linearly (at least at first).  10
threads shot up to 42000 IOPS, 40 threads to over 60000 IOPS.

We did find that RAID0 was able to scale - at 100 threads we hit around
106000 IOPS on our Nexenta based Z200 from Pogo Linux but performance on
a single thread is still less than performance to a single "spindle", a
single session.  Why is that? Thanks - John
-- 
John A. Sullivan III
Open Source Development Corporation
+1 207-985-7880
jsullivan@opensourcedevel.com

http://www.spiritualoutreach.com
Making Christianity intelligible to secular society

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Shell Scripts or Arbitrary Priority Callouts?
  2009-03-22 15:27       ` Pasi Kärkkäinen
  2009-03-22 16:50         ` John A. Sullivan III
@ 2009-03-23  4:42         ` Christopher Chen
  2009-03-23  9:46         ` John A. Sullivan III
  2 siblings, 0 replies; 34+ messages in thread
From: Christopher Chen @ 2009-03-23  4:42 UTC (permalink / raw)
  Cc: device-mapper development, Ross S. W. Walker

On Mar 22, 2009, at 8:27, Pasi Kärkkäinen <pasik@iki.fi> wrote:

> On Fri, Mar 20, 2009 at 06:01:23AM -0400, John A. Sullivan III wrote:
>>>
>>> John:
>>>
>>> Thanks for the reply.
>>>
>>> I ended up writing a small C program to do the priority  
>>> computation for me.
>>>
>>> I have two sets of FC-AL shelves attached to two dual-channel Qlogic
>>> cards. That gives me two paths to each disk. I have about 56  
>>> spindles
>>> in the current configuration, and am tying them together with md
>>> software raid.
>>>
>>> Now, even though each disk says it handles concurrent I/O on each
>>> port, my testing indicates that throughput suffers when using  
>>> multibus
>>> by about 1/2 (from ~60 MB/sec sustained I/O with failover to 35 MB/ 
>>> sec
>>> when using multibus).
>>>
>>> However, with failover, I am effectively using only one channel on
>>> each card. With my custom priority callout, I more or less match the
>>> disks with even numbers to the even numbered scsi channels with a
>>> higher priority. Same with the odd numbered disks and odd numbered
>>> channels. The odds are 2ndary on even and vice versa. It seems to  
>>> work
>>> rather well, and appears to spread the load nicely.
>>>
>>> Thanks again for your help!
>>>
>> I'm really glad you brought up the performance problem. I had posted
>> about it a few days ago but it seems to have gotten lost.  We are  
>> really
>> struggling with performance issues when attempting to combine  
>> multiple
>> paths (in the case of multipath to one big target) or targets (in the
>> case of software RAID0 across several targets) rather than using, in
>> effect, JBODs.  In our case, we are using iSCSI.
>>
>> Like you, we found that using multibus caused almost a linear drop in
>> performance.  Round robin across two paths was half as much as  
>> aggregate
>> throughput to two separate disks, four paths, one fourth.
>>
>> We also tried striping across the targets with software RAID0  
>> combined
>> with failover multipath - roughly the same effect.
>>
>> We really don't want to be forced to treated SAN attached disks as
>> JDOBs.  Has anyone cracked this problem of using them in either  
>> multibus
>> or RAID0 so we can present them as a single device to the OS and  
>> still
>> load balance multiple paths.  This is a HUGE problem for us so any  
>> help
>> is greatly appreciated.  Thanks- John
>
> Hello.
>
> Hmm.. just a guess, but could this be related to the fact that if  
> your paths
> to the storage are different iSCSI sessions (open-iscsi _doesn't_  
> support
> multiple connections per session aka MC/s), then there is a separate  
> SCSI
> command queue per path.. and if SCSI requests are split across those  
> queues
> they can get out-of-order and that causes performance drop?
>
> See:
> http://www.nabble.com/round-robin-with-vmware-initiator-and-iscsi-target-td21958346.html
>
> Especially the reply from Ross (CC). Maybe he has some comments :)
>
> -- Pasi
>
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel

My problem with dm multipath multibus running at half the speed of  
failover is not with iscsi but with some fibre channel disk shelves  
I'm treating as a JBOD. I have two loops, and each FC drive is capable  
of doing concurrent io through both ports.

I am exporting them via iscsi but the performance dropoff I'm seeing  
is on the local host with the HBAs.

cc

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Shell Scripts or Arbitrary Priority Callouts?
  2009-03-22 15:27       ` Pasi Kärkkäinen
  2009-03-22 16:50         ` John A. Sullivan III
  2009-03-23  4:42         ` Christopher Chen
@ 2009-03-23  9:46         ` John A. Sullivan III
       [not found]           ` <CF307021-DE23-4BB1-BC6D-F4F520464208@medallion.com>
  2009-03-24  7:39           ` Pasi Kärkkäinen
  2 siblings, 2 replies; 34+ messages in thread
From: John A. Sullivan III @ 2009-03-23  9:46 UTC (permalink / raw)
  To: device-mapper development; +Cc: Ross S. W. Walker

On Sun, 2009-03-22 at 17:27 +0200, Pasi Kärkkäinen wrote:
> On Fri, Mar 20, 2009 at 06:01:23AM -0400, John A. Sullivan III wrote:
> > > 
> > > John:
> > > 
> > > Thanks for the reply.
> > > 
> > > I ended up writing a small C program to do the priority computation for me.
> > > 
> > > I have two sets of FC-AL shelves attached to two dual-channel Qlogic
> > > cards. That gives me two paths to each disk. I have about 56 spindles
> > > in the current configuration, and am tying them together with md
> > > software raid.
> > > 
> > > Now, even though each disk says it handles concurrent I/O on each
> > > port, my testing indicates that throughput suffers when using multibus
> > > by about 1/2 (from ~60 MB/sec sustained I/O with failover to 35 MB/sec
> > > when using multibus).
> > > 
> > > However, with failover, I am effectively using only one channel on
> > > each card. With my custom priority callout, I more or less match the
> > > disks with even numbers to the even numbered scsi channels with a
> > > higher priority. Same with the odd numbered disks and odd numbered
> > > channels. The odds are 2ndary on even and vice versa. It seems to work
> > > rather well, and appears to spread the load nicely.
> > > 
> > > Thanks again for your help!
> > > 
> > I'm really glad you brought up the performance problem. I had posted
> > about it a few days ago but it seems to have gotten lost.  We are really
> > struggling with performance issues when attempting to combine multiple
> > paths (in the case of multipath to one big target) or targets (in the
> > case of software RAID0 across several targets) rather than using, in
> > effect, JBODs.  In our case, we are using iSCSI.
> > 
> > Like you, we found that using multibus caused almost a linear drop in
> > performance.  Round robin across two paths was half as much as aggregate
> > throughput to two separate disks, four paths, one fourth.
> > 
> > We also tried striping across the targets with software RAID0 combined
> > with failover multipath - roughly the same effect.
> > 
> > We really don't want to be forced to treated SAN attached disks as
> > JDOBs.  Has anyone cracked this problem of using them in either multibus
> > or RAID0 so we can present them as a single device to the OS and still
> > load balance multiple paths.  This is a HUGE problem for us so any help
> > is greatly appreciated.  Thanks- John
> 
> Hello.
> 
> Hmm.. just a guess, but could this be related to the fact that if your paths
> to the storage are different iSCSI sessions (open-iscsi _doesn't_ support
> multiple connections per session aka MC/s), then there is a separate SCSI
> command queue per path.. and if SCSI requests are split across those queues 
> they can get out-of-order and that causes performance drop?
> 
> See:
> http://www.nabble.com/round-robin-with-vmware-initiator-and-iscsi-target-td21958346.html
> 
> Especially the reply from Ross (CC). Maybe he has some comments :) 
> 
> -- Pasi
<snip>
I'm trying to spend a little time on this today and am really feeling my
ignorance on the way iSCSI works :(  It looks like linux-iscsi supports
MC/S but has not been in active development and will not even compile on
my 2.6.27 kernel.

To simplify matters, I did put each SAN interface on a separate network.
Thus, all the different sessions.  If I place them all on the same
network and use the iface parameters of open-iscsi, does that eliminate
the out-of-order problem and allow me to achieve the performance
scalability I'm seeking from dm-multipath in multibus mode? Thanks -
John
-- 
John A. Sullivan III
Open Source Development Corporation
+1 207-985-7880
jsullivan@opensourcedevel.com

http://www.spiritualoutreach.com
Making Christianity intelligible to secular society

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Shell Scripts or Arbitrary Priority Callouts?
       [not found]           ` <CF307021-DE23-4BB1-BC6D-F4F520464208@medallion.com>
@ 2009-03-23 13:07             ` John A. Sullivan III
  0 siblings, 0 replies; 34+ messages in thread
From: John A. Sullivan III @ 2009-03-23 13:07 UTC (permalink / raw)
  To: Ross S. W. Walker; +Cc: device-mapper development

On Mon, 2009-03-23 at 08:50 -0400, Ross S. W. Walker wrote:
> On Mar 23, 2009, at 5:46 AM, "John A. Sullivan III"
> <jsullivan@opensourcedevel.com> wrote:
> 
> 
> 
> 
> > On Sun, 2009-03-22 at 17:27 +0200, Pasi Kärkkäinen wrote:
> > > On Fri, Mar 20, 2009 at 06:01:23AM -0400, John A. Sullivan III
> > wrote:
> > > > >
> > > > > John:
> > > > >
> > > > > Thanks for the reply.
> > > > >
> > > > > I ended up writing a small C program to do the priority
> > computation for me.
> > > > >
> > > > > I have two sets of FC-AL shelves attached to two dual-channel
> > Qlogic
> > > > > cards. That gives me two paths to each disk. I have about 56
> > spindles
> > > > > in the current configuration, and am tying them together with
> > md
> > > > > software raid.
> > > > >
> > > > > Now, even though each disk says it handles concurrent I/O on
> > each
> > > > > port, my testing indicates that throughput suffers when using
> > multibus
> > > > > by about 1/2 (from ~60 MB/sec sustained I/O with failover to
> > 35 MB/sec
> > > > > when using multibus).
> > > > >
> > > > > However, with failover, I am effectively using only one
> > channel on
> > > > > each card. With my custom priority callout, I more or less
> > match the
> > > > > disks with even numbers to the even numbered scsi channels
> > with a
> > > > > higher priority. Same with the odd numbered disks and odd
> > numbered
> > > > > channels. The odds are 2ndary on even and vice versa. It seems
> > to work
> > > > > rather well, and appears to spread the load nicely.
> > > > >
> > > > > Thanks again for your help!
> > > > >
> > > > I'm really glad you brought up the performance problem. I had
> > posted
> > > > about it a few days ago but it seems to have gotten lost.  We
> > are really
> > > > struggling with performance issues when attempting to combine
> > multiple
> > > > paths (in the case of multipath to one big target) or targets
> > (in the
> > > > case of software RAID0 across several targets) rather than
> > using, in
> > > > effect, JBODs.  In our case, we are using iSCSI.
> > > >
> > > > Like you, we found that using multibus caused almost a linear
> > drop in
> > > > performance.  Round robin across two paths was half as much as
> > aggregate
> > > > throughput to two separate disks, four paths, one fourth.
> > > >
> > > > We also tried striping across the targets with software RAID0
> > combined
> > > > with failover multipath - roughly the same effect.
> > > >
> > > > We really don't want to be forced to treated SAN attached disks
> > as
> > > > JDOBs.  Has anyone cracked this problem of using them in either
> > multibus
> > > > or RAID0 so we can present them as a single device to the OS and
> > still
> > > > load balance multiple paths.  This is a HUGE problem for us so
> > any help
> > > > is greatly appreciated.  Thanks- John
> > >
> > > Hello.
> > >
> > > Hmm.. just a guess, but could this be related to the fact that if
> > your paths
> > > to the storage are different iSCSI sessions (open-iscsi _doesn't_
> > support
> > > multiple connections per session aka MC/s), then there is a
> > separate SCSI
> > > command queue per path.. and if SCSI requests are split across
> > those queues
> > > they can get out-of-order and that causes performance drop?
> > >
> > > See:
> > >
> > http://www.nabble.com/round-robin-with-vmware-initiator-and-iscsi-target-td21958346.html
> > >
> > > Especially the reply from Ross (CC). Maybe he has some comments :)
> > >
> > > -- Pasi
> > <snip>
> > I'm trying to spend a little time on this today and am really
> > feeling my
> > ignorance on the way iSCSI works :(  It looks like linux-iscsi
> > supports
> > MC/S but has not been in active development and will not even
> > compile on
> > my 2.6.27 kernel.
> > 
> > To simplify matters, I did put each SAN interface on a separate
> > network.
> > Thus, all the different sessions.  If I place them all on the same
> > network and use the iface parameters of open-iscsi, does that
> > eliminate
> > the out-of-order problem and allow me to achieve the performance
> > scalability I'm seeking from dm-multipath in multibus mode? Thanks -
> > John
> > 
> > 
> > 
> No, the only way to eliminate the out-of-order problem is MC/s. You
> can mask the issue when using IET by using fileio which caches these
> in page cache which will coalesce these before actually going to disk.
> 
> 
> The issue here seems it might be dm-multipath though.
> 
> 
> If your workload is random though, which most is, then sequential
> performance is inconsequential.
> 
> 
> -Ross
<snip>
Thanks very much, Ross.  As I was working through the iface options of
open-iscsi this morning, I began to realize, just as you point out, they
are still separate sessions.  We are not using Linux on the target end
but rather ZFS on opensolaris via Nexenta.  I'll have to find out from
them what the equivalent to fileio is.  Thanks again - John
> 
-- 
John A. Sullivan III
Open Source Development Corporation
+1 207-985-7880
jsullivan@opensourcedevel.com

http://www.spiritualoutreach.com
Making Christianity intelligible to secular society

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Shell Scripts or Arbitrary Priority Callouts?
  2009-03-23  9:46         ` John A. Sullivan III
       [not found]           ` <CF307021-DE23-4BB1-BC6D-F4F520464208@medallion.com>
@ 2009-03-24  7:39           ` Pasi Kärkkäinen
  2009-03-24 11:02             ` John A. Sullivan III
  1 sibling, 1 reply; 34+ messages in thread
From: Pasi Kärkkäinen @ 2009-03-24  7:39 UTC (permalink / raw)
  To: device-mapper development; +Cc: Ross S. W. Walker

On Mon, Mar 23, 2009 at 05:46:36AM -0400, John A. Sullivan III wrote:
> On Sun, 2009-03-22 at 17:27 +0200, Pasi Kärkkäinen wrote:
> > On Fri, Mar 20, 2009 at 06:01:23AM -0400, John A. Sullivan III wrote:
> > > > 
> > > > John:
> > > > 
> > > > Thanks for the reply.
> > > > 
> > > > I ended up writing a small C program to do the priority computation for me.
> > > > 
> > > > I have two sets of FC-AL shelves attached to two dual-channel Qlogic
> > > > cards. That gives me two paths to each disk. I have about 56 spindles
> > > > in the current configuration, and am tying them together with md
> > > > software raid.
> > > > 
> > > > Now, even though each disk says it handles concurrent I/O on each
> > > > port, my testing indicates that throughput suffers when using multibus
> > > > by about 1/2 (from ~60 MB/sec sustained I/O with failover to 35 MB/sec
> > > > when using multibus).
> > > > 
> > > > However, with failover, I am effectively using only one channel on
> > > > each card. With my custom priority callout, I more or less match the
> > > > disks with even numbers to the even numbered scsi channels with a
> > > > higher priority. Same with the odd numbered disks and odd numbered
> > > > channels. The odds are 2ndary on even and vice versa. It seems to work
> > > > rather well, and appears to spread the load nicely.
> > > > 
> > > > Thanks again for your help!
> > > > 
> > > I'm really glad you brought up the performance problem. I had posted
> > > about it a few days ago but it seems to have gotten lost.  We are really
> > > struggling with performance issues when attempting to combine multiple
> > > paths (in the case of multipath to one big target) or targets (in the
> > > case of software RAID0 across several targets) rather than using, in
> > > effect, JBODs.  In our case, we are using iSCSI.
> > > 
> > > Like you, we found that using multibus caused almost a linear drop in
> > > performance.  Round robin across two paths was half as much as aggregate
> > > throughput to two separate disks, four paths, one fourth.
> > > 
> > > We also tried striping across the targets with software RAID0 combined
> > > with failover multipath - roughly the same effect.
> > > 
> > > We really don't want to be forced to treated SAN attached disks as
> > > JDOBs.  Has anyone cracked this problem of using them in either multibus
> > > or RAID0 so we can present them as a single device to the OS and still
> > > load balance multiple paths.  This is a HUGE problem for us so any help
> > > is greatly appreciated.  Thanks- John
> > 
> > Hello.
> > 
> > Hmm.. just a guess, but could this be related to the fact that if your paths
> > to the storage are different iSCSI sessions (open-iscsi _doesn't_ support
> > multiple connections per session aka MC/s), then there is a separate SCSI
> > command queue per path.. and if SCSI requests are split across those queues 
> > they can get out-of-order and that causes performance drop?
> > 
> > See:
> > http://www.nabble.com/round-robin-with-vmware-initiator-and-iscsi-target-td21958346.html
> > 
> > Especially the reply from Ross (CC). Maybe he has some comments :) 
> > 
> > -- Pasi
> <snip>
> I'm trying to spend a little time on this today and am really feeling my
> ignorance on the way iSCSI works :(  It looks like linux-iscsi supports
> MC/S but has not been in active development and will not even compile on
> my 2.6.27 kernel.
> 
> To simplify matters, I did put each SAN interface on a separate network.
> Thus, all the different sessions.  If I place them all on the same
> network and use the iface parameters of open-iscsi, does that eliminate
> the out-of-order problem and allow me to achieve the performance
> scalability I'm seeking from dm-multipath in multibus mode? Thanks -
> John

If you use ifaces feature of open-iscsi, you still get separate sessions.

open-iscsi just does not support MC/s :(

I think core-iscsi does support MC/s.. 

Then you again you should play with the different multipath settings, and
tweak how often IOs are split to different paths etc.. maybe that helps.

-- Pasi

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Shell Scripts or Arbitrary Priority Callouts?
  2009-03-24  7:39           ` Pasi Kärkkäinen
@ 2009-03-24 11:02             ` John A. Sullivan III
  2009-03-24 11:57               ` Pasi Kärkkäinen
  0 siblings, 1 reply; 34+ messages in thread
From: John A. Sullivan III @ 2009-03-24 11:02 UTC (permalink / raw)
  To: device-mapper development; +Cc: Ross S. W. Walker

On Tue, 2009-03-24 at 09:39 +0200, Pasi Kärkkäinen wrote:
> On Mon, Mar 23, 2009 at 05:46:36AM -0400, John A. Sullivan III wrote:
> > On Sun, 2009-03-22 at 17:27 +0200, Pasi Kärkkäinen wrote:
> > > On Fri, Mar 20, 2009 at 06:01:23AM -0400, John A. Sullivan III wrote:
> > > > > 
> > > > > John:
> > > > > 
> > > > > Thanks for the reply.
> > > > > 
> > > > > I ended up writing a small C program to do the priority computation for me.
> > > > > 
> > > > > I have two sets of FC-AL shelves attached to two dual-channel Qlogic
> > > > > cards. That gives me two paths to each disk. I have about 56 spindles
> > > > > in the current configuration, and am tying them together with md
> > > > > software raid.
> > > > > 
> > > > > Now, even though each disk says it handles concurrent I/O on each
> > > > > port, my testing indicates that throughput suffers when using multibus
> > > > > by about 1/2 (from ~60 MB/sec sustained I/O with failover to 35 MB/sec
> > > > > when using multibus).
> > > > > 
> > > > > However, with failover, I am effectively using only one channel on
> > > > > each card. With my custom priority callout, I more or less match the
> > > > > disks with even numbers to the even numbered scsi channels with a
> > > > > higher priority. Same with the odd numbered disks and odd numbered
> > > > > channels. The odds are 2ndary on even and vice versa. It seems to work
> > > > > rather well, and appears to spread the load nicely.
> > > > > 
> > > > > Thanks again for your help!
> > > > > 
> > > > I'm really glad you brought up the performance problem. I had posted
> > > > about it a few days ago but it seems to have gotten lost.  We are really
> > > > struggling with performance issues when attempting to combine multiple
> > > > paths (in the case of multipath to one big target) or targets (in the
> > > > case of software RAID0 across several targets) rather than using, in
> > > > effect, JBODs.  In our case, we are using iSCSI.
> > > > 
> > > > Like you, we found that using multibus caused almost a linear drop in
> > > > performance.  Round robin across two paths was half as much as aggregate
> > > > throughput to two separate disks, four paths, one fourth.
> > > > 
> > > > We also tried striping across the targets with software RAID0 combined
> > > > with failover multipath - roughly the same effect.
> > > > 
> > > > We really don't want to be forced to treated SAN attached disks as
> > > > JDOBs.  Has anyone cracked this problem of using them in either multibus
> > > > or RAID0 so we can present them as a single device to the OS and still
> > > > load balance multiple paths.  This is a HUGE problem for us so any help
> > > > is greatly appreciated.  Thanks- John
> > > 
> > > Hello.
> > > 
> > > Hmm.. just a guess, but could this be related to the fact that if your paths
> > > to the storage are different iSCSI sessions (open-iscsi _doesn't_ support
> > > multiple connections per session aka MC/s), then there is a separate SCSI
> > > command queue per path.. and if SCSI requests are split across those queues 
> > > they can get out-of-order and that causes performance drop?
> > > 
> > > See:
> > > http://www.nabble.com/round-robin-with-vmware-initiator-and-iscsi-target-td21958346.html
> > > 
> > > Especially the reply from Ross (CC). Maybe he has some comments :) 
> > > 
> > > -- Pasi
> > <snip>
> > I'm trying to spend a little time on this today and am really feeling my
> > ignorance on the way iSCSI works :(  It looks like linux-iscsi supports
> > MC/S but has not been in active development and will not even compile on
> > my 2.6.27 kernel.
> > 
> > To simplify matters, I did put each SAN interface on a separate network.
> > Thus, all the different sessions.  If I place them all on the same
> > network and use the iface parameters of open-iscsi, does that eliminate
> > the out-of-order problem and allow me to achieve the performance
> > scalability I'm seeking from dm-multipath in multibus mode? Thanks -
> > John
> 
> If you use ifaces feature of open-iscsi, you still get separate sessions.
> 
> open-iscsi just does not support MC/s :(
> 
> I think core-iscsi does support MC/s.. 
> 
> Then you again you should play with the different multipath settings, and
> tweak how often IOs are split to different paths etc.. maybe that helps.
> 
> -- Pasi
<snip>
I think we're pretty much at the end of our options here but I document
what I've found thus far for closure.

Indeed, there seems to be no way around the session problem.  Core-iscsi
does seem to support MC/s but has not been updated in years.  It did not
compile with my 2.6.27 kernel and, given that others seem to have had
the same problem, I did not spend a lot of time troubleshooting it.

We did play with the multipath rr_min_io settings and smaller always
seemed to be better until we got into very large numbers of session.  We
were testing on a dual quad core AMD Shanghai 2378 system with 32 GB
RAM, a quad port Intel e1000 card and two on-board nvidia forcedeth
ports with disktest using 4K blocks to mimic the file system using
sequential reads (and some sequential writes).

With a single thread, there was no difference at all - only about 12.79
MB/s no matter what we did.  With 10 threads and only two interfaces,
there was only a slight difference between rr=1 (81.2B/s), rr=10 (78.87)
and rr=100 (80).

However, when we opened to three and four interfaces, there was a huge
jump for rr=1 (100.4, 105.95) versus rr=10 (80.5, 80.75) and rr=100
(74.3, 77.6).

At 100 threads on three or four ports, the best performance shifted to
rr=10 (327 MB/s, 335) rather than rr=1 (291.7, 290.1) or rr=100 (216.3).
At 400 threads, rr=100 started to overtake rr=10 slightly.

This was using all e1000 interfaces. Our first four port test included
one of the on board ports and performance was dramatically less than
three e1000 ports.  Subsequent testing tweaking forcedeth parameters
from defaults yielded no improvement.

After solving the I/O scheduler problem, dm RAID0 behaved better.  It
still did not give us anywhere near a fourfold increase (four disks on
four separate ports) but only marginal improvement (14.3 MB/s) using c=8
(to fit into a jumbo packet, match the zvol block size on the back end
and be two block sizes).  It did, however, give the best balance of
performance being just slightly slower than rr=1 at 10 threads and
slightly slower than rr=10 at 100 threads though not scaling as well to
400 threads.

Thus, collective throughput is acceptable but individual throughput is
still awful.

Thanks, all - John
-- 
John A. Sullivan III
Open Source Development Corporation
+1 207-985-7880
jsullivan@opensourcedevel.com

http://www.spiritualoutreach.com
Making Christianity intelligible to secular society

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Shell Scripts or Arbitrary Priority Callouts?
  2009-03-24 11:02             ` John A. Sullivan III
@ 2009-03-24 11:57               ` Pasi Kärkkäinen
  2009-03-24 12:21                 ` John A. Sullivan III
  0 siblings, 1 reply; 34+ messages in thread
From: Pasi Kärkkäinen @ 2009-03-24 11:57 UTC (permalink / raw)
  To: device-mapper development; +Cc: Ross S. W. Walker

On Tue, Mar 24, 2009 at 07:02:41AM -0400, John A. Sullivan III wrote:
> > > <snip>
> > > I'm trying to spend a little time on this today and am really feeling my
> > > ignorance on the way iSCSI works :(  It looks like linux-iscsi supports
> > > MC/S but has not been in active development and will not even compile on
> > > my 2.6.27 kernel.
> > > 
> > > To simplify matters, I did put each SAN interface on a separate network.
> > > Thus, all the different sessions.  If I place them all on the same
> > > network and use the iface parameters of open-iscsi, does that eliminate
> > > the out-of-order problem and allow me to achieve the performance
> > > scalability I'm seeking from dm-multipath in multibus mode? Thanks -
> > > John
> > 
> > If you use ifaces feature of open-iscsi, you still get separate sessions.
> > 
> > open-iscsi just does not support MC/s :(
> > 
> > I think core-iscsi does support MC/s.. 
> > 
> > Then you again you should play with the different multipath settings, and
> > tweak how often IOs are split to different paths etc.. maybe that helps.
> > 
> > -- Pasi
> <snip>
> I think we're pretty much at the end of our options here but I document
> what I've found thus far for closure.
> 
> Indeed, there seems to be no way around the session problem.  Core-iscsi
> does seem to support MC/s but has not been updated in years.  It did not
> compile with my 2.6.27 kernel and, given that others seem to have had
> the same problem, I did not spend a lot of time troubleshooting it.
> 

Core-iscsi developer seems to be active developing at least the 
new iSCSI target (LIO target).. I think he has been testing it with
core-iscsi, so maybe there's newer version somewhere? 

> We did play with the multipath rr_min_io settings and smaller always
> seemed to be better until we got into very large numbers of session.  We
> were testing on a dual quad core AMD Shanghai 2378 system with 32 GB
> RAM, a quad port Intel e1000 card and two on-board nvidia forcedeth
> ports with disktest using 4K blocks to mimic the file system using
> sequential reads (and some sequential writes).
> 

Nice hardware. Btw are you using jumbo frames or flow control for iSCSI
traffic? 

> With a single thread, there was no difference at all - only about 12.79
> MB/s no matter what we did.  With 10 threads and only two interfaces,
> there was only a slight difference between rr=1 (81.2B/s), rr=10 (78.87)
> and rr=100 (80).
> 
> However, when we opened to three and four interfaces, there was a huge
> jump for rr=1 (100.4, 105.95) versus rr=10 (80.5, 80.75) and rr=100
> (74.3, 77.6).
> 
> At 100 threads on three or four ports, the best performance shifted to
> rr=10 (327 MB/s, 335) rather than rr=1 (291.7, 290.1) or rr=100 (216.3).
> At 400 threads, rr=100 started to overtake rr=10 slightly.
> 
> This was using all e1000 interfaces. Our first four port test included
> one of the on board ports and performance was dramatically less than
> three e1000 ports.  Subsequent testing tweaking forcedeth parameters
> from defaults yielded no improvement.
> 
> After solving the I/O scheduler problem, dm RAID0 behaved better.  It
> still did not give us anywhere near a fourfold increase (four disks on
> four separate ports) but only marginal improvement (14.3 MB/s) using c=8
> (to fit into a jumbo packet, match the zvol block size on the back end
> and be two block sizes).  It did, however, give the best balance of
> performance being just slightly slower than rr=1 at 10 threads and
> slightly slower than rr=10 at 100 threads though not scaling as well to
> 400 threads.
> 

When you used dm RAID0 you didn't have any multipath configuration, right? 

What kind of stripe size and other settings you had for RAID0?

What kind of performance do you get using just a single iscsi session (and
thus just a single path), no multipathing, no DM RAID0 ? Just a filesystem
directly on top of the iscsi /dev/sd? device.

> Thus, collective throughput is acceptable but individual throughput is
> still awful.
> 

Sounds like there's some other problem if invidual throughput is bad? Or did
you mean performance with a single disktest IO thread is bad, but using multiple
disktest threads it's good.. that would make more sense :) 

-- Pasi

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Shell Scripts or Arbitrary Priority Callouts?
  2009-03-24 11:57               ` Pasi Kärkkäinen
@ 2009-03-24 12:21                 ` John A. Sullivan III
  2009-03-24 15:01                   ` Pasi Kärkkäinen
  0 siblings, 1 reply; 34+ messages in thread
From: John A. Sullivan III @ 2009-03-24 12:21 UTC (permalink / raw)
  To: device-mapper development; +Cc: Ross S. W. Walker

On Tue, 2009-03-24 at 13:57 +0200, Pasi Kärkkäinen wrote:
> On Tue, Mar 24, 2009 at 07:02:41AM -0400, John A. Sullivan III wrote:
> > > > <snip>
> > > > I'm trying to spend a little time on this today and am really feeling my
> > > > ignorance on the way iSCSI works :(  It looks like linux-iscsi supports
> > > > MC/S but has not been in active development and will not even compile on
> > > > my 2.6.27 kernel.
> > > > 
> > > > To simplify matters, I did put each SAN interface on a separate network.
> > > > Thus, all the different sessions.  If I place them all on the same
> > > > network and use the iface parameters of open-iscsi, does that eliminate
> > > > the out-of-order problem and allow me to achieve the performance
> > > > scalability I'm seeking from dm-multipath in multibus mode? Thanks -
> > > > John
> > > 
> > > If you use ifaces feature of open-iscsi, you still get separate sessions.
> > > 
> > > open-iscsi just does not support MC/s :(
> > > 
> > > I think core-iscsi does support MC/s.. 
> > > 
> > > Then you again you should play with the different multipath settings, and
> > > tweak how often IOs are split to different paths etc.. maybe that helps.
> > > 
> > > -- Pasi
> > <snip>
> > I think we're pretty much at the end of our options here but I document
> > what I've found thus far for closure.
> > 
> > Indeed, there seems to be no way around the session problem.  Core-iscsi
> > does seem to support MC/s but has not been updated in years.  It did not
> > compile with my 2.6.27 kernel and, given that others seem to have had
> > the same problem, I did not spend a lot of time troubleshooting it.
> > 
> 
> Core-iscsi developer seems to be active developing at least the 
> new iSCSI target (LIO target).. I think he has been testing it with
> core-iscsi, so maybe there's newer version somewhere? 
> 
> > We did play with the multipath rr_min_io settings and smaller always
> > seemed to be better until we got into very large numbers of session.  We
> > were testing on a dual quad core AMD Shanghai 2378 system with 32 GB
> > RAM, a quad port Intel e1000 card and two on-board nvidia forcedeth
> > ports with disktest using 4K blocks to mimic the file system using
> > sequential reads (and some sequential writes).
> > 
> 
> Nice hardware. Btw are you using jumbo frames or flow control for iSCSI
> traffic? 
> 
> > With a single thread, there was no difference at all - only about 12.79
> > MB/s no matter what we did.  With 10 threads and only two interfaces,
> > there was only a slight difference between rr=1 (81.2B/s), rr=10 (78.87)
> > and rr=100 (80).
> > 
> > However, when we opened to three and four interfaces, there was a huge
> > jump for rr=1 (100.4, 105.95) versus rr=10 (80.5, 80.75) and rr=100
> > (74.3, 77.6).
> > 
> > At 100 threads on three or four ports, the best performance shifted to
> > rr=10 (327 MB/s, 335) rather than rr=1 (291.7, 290.1) or rr=100 (216.3).
> > At 400 threads, rr=100 started to overtake rr=10 slightly.
> > 
> > This was using all e1000 interfaces. Our first four port test included
> > one of the on board ports and performance was dramatically less than
> > three e1000 ports.  Subsequent testing tweaking forcedeth parameters
> > from defaults yielded no improvement.
> > 
> > After solving the I/O scheduler problem, dm RAID0 behaved better.  It
> > still did not give us anywhere near a fourfold increase (four disks on
> > four separate ports) but only marginal improvement (14.3 MB/s) using c=8
> > (to fit into a jumbo packet, match the zvol block size on the back end
> > and be two block sizes).  It did, however, give the best balance of
> > performance being just slightly slower than rr=1 at 10 threads and
> > slightly slower than rr=10 at 100 threads though not scaling as well to
> > 400 threads.
> > 
> 
> When you used dm RAID0 you didn't have any multipath configuration, right? 
Correct although we also did test successfully with multipath in
failover mode and RAID0.
> 
> What kind of stripe size and other settings you had for RAID0?
Chunk size was 8KB with four disks.  
> 
> What kind of performance do you get using just a single iscsi session (and
> thus just a single path), no multipathing, no DM RAID0 ? Just a filesystem
> directly on top of the iscsi /dev/sd? device.
Miserable - same roughly 12 MB/s.
> 
> > Thus, collective throughput is acceptable but individual throughput is
> > still awful.
> > 
> 
> Sounds like there's some other problem if invidual throughput is bad? Or did
> you mean performance with a single disktest IO thread is bad, but using multiple
> disktest threads it's good.. that would make more sense :) 
Yes, the latter.  Single thread (I assume mimicking a single disk
operation, e.g., copying a large file) is miserable - much slower than
local disk despite the availability of huge bandwidth.  We start
utilizing the bandwidth when multiplying concurrent disk activity into
the hundreds.

I am guessing the single thread performance problem is an open-iscsi
issue but I was hoping multipath would help us work around it by
utilizing multiple sessions per disk operation.  I suppose that is where
we run into the command ordering problem unless there is something else
afoot.  Thanks - John
<snip>
-- 
John A. Sullivan III
Open Source Development Corporation
+1 207-985-7880
jsullivan@opensourcedevel.com

http://www.spiritualoutreach.com
Making Christianity intelligible to secular society

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Shell Scripts or Arbitrary Priority Callouts?
  2009-03-24 12:21                 ` John A. Sullivan III
@ 2009-03-24 15:01                   ` Pasi Kärkkäinen
  2009-03-24 15:09                     ` Pasi Kärkkäinen
  2009-03-24 15:43                     ` John A. Sullivan III
  0 siblings, 2 replies; 34+ messages in thread
From: Pasi Kärkkäinen @ 2009-03-24 15:01 UTC (permalink / raw)
  To: device-mapper development; +Cc: Ross S. W. Walker

On Tue, Mar 24, 2009 at 08:21:45AM -0400, John A. Sullivan III wrote:
> > 
> > Core-iscsi developer seems to be active developing at least the 
> > new iSCSI target (LIO target).. I think he has been testing it with
> > core-iscsi, so maybe there's newer version somewhere? 
> > 
> > > We did play with the multipath rr_min_io settings and smaller always
> > > seemed to be better until we got into very large numbers of session.  We
> > > were testing on a dual quad core AMD Shanghai 2378 system with 32 GB
> > > RAM, a quad port Intel e1000 card and two on-board nvidia forcedeth
> > > ports with disktest using 4K blocks to mimic the file system using
> > > sequential reads (and some sequential writes).
> > > 
> > 
> > Nice hardware. Btw are you using jumbo frames or flow control for iSCSI
> > traffic? 
> > 

Dunno if you noticed this.. :) 


> > > 
> > 
> > When you used dm RAID0 you didn't have any multipath configuration, right? 
> Correct although we also did test successfully with multipath in
> failover mode and RAID0.
> > 

OK.

> > What kind of stripe size and other settings you had for RAID0?
> Chunk size was 8KB with four disks.  
> > 

Did you try with much bigger sizes.. 128 kB ?

> > What kind of performance do you get using just a single iscsi session (and
> > thus just a single path), no multipathing, no DM RAID0 ? Just a filesystem
> > directly on top of the iscsi /dev/sd? device.
> Miserable - same roughly 12 MB/s.

OK, Here's your problem. Was this btw reads or writes? Did you tune
readahead-settings? 

Can paste your iSCSI session settings negotiated with the target? 

> > 
> > Sounds like there's some other problem if invidual throughput is bad? Or did
> > you mean performance with a single disktest IO thread is bad, but using multiple
> > disktest threads it's good.. that would make more sense :) 
> Yes, the latter.  Single thread (I assume mimicking a single disk
> operation, e.g., copying a large file) is miserable - much slower than
> local disk despite the availability of huge bandwidth.  We start
> utilizing the bandwidth when multiplying concurrent disk activity into
> the hundreds.
> 
> I am guessing the single thread performance problem is an open-iscsi
> issue but I was hoping multipath would help us work around it by
> utilizing multiple sessions per disk operation.  I suppose that is where
> we run into the command ordering problem unless there is something else
> afoot.  Thanks - John

You should be able to get many times the throughput you get now.. just with
a single path/session.

What kind of latency do you have from the initiator to the target/storage? 

Try with for example 4 kB ping:
ping -s 4096 <ip_of_the_iscsi_target>

1000ms divided by the roundtrip you get from ping should give you maximum
possible IOPS using a single path.. 

4 kB * IOPS == max bandwidth you can achieve.

-- Pasi

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Shell Scripts or Arbitrary Priority Callouts?
  2009-03-24 15:01                   ` Pasi Kärkkäinen
@ 2009-03-24 15:09                     ` Pasi Kärkkäinen
  2009-03-24 15:43                     ` John A. Sullivan III
  1 sibling, 0 replies; 34+ messages in thread
From: Pasi Kärkkäinen @ 2009-03-24 15:09 UTC (permalink / raw)
  To: device-mapper development; +Cc: Ross S. W. Walker

On Tue, Mar 24, 2009 at 05:01:04PM +0200, Pasi Kärkkäinen wrote:
> On Tue, Mar 24, 2009 at 08:21:45AM -0400, John A. Sullivan III wrote:
> > > 
> > > Core-iscsi developer seems to be active developing at least the 
> > > new iSCSI target (LIO target).. I think he has been testing it with
> > > core-iscsi, so maybe there's newer version somewhere? 
> > > 
> > > > We did play with the multipath rr_min_io settings and smaller always
> > > > seemed to be better until we got into very large numbers of session.  We
> > > > were testing on a dual quad core AMD Shanghai 2378 system with 32 GB
> > > > RAM, a quad port Intel e1000 card and two on-board nvidia forcedeth
> > > > ports with disktest using 4K blocks to mimic the file system using
> > > > sequential reads (and some sequential writes).
> > > > 
> > > 
> > > Nice hardware. Btw are you using jumbo frames or flow control for iSCSI
> > > traffic? 
> > > 
> 
> Dunno if you noticed this.. :) 
> 
> 
> > > > 
> > > 
> > > When you used dm RAID0 you didn't have any multipath configuration, right? 
> > Correct although we also did test successfully with multipath in
> > failover mode and RAID0.
> > > 
> 
> OK.
> 
> > > What kind of stripe size and other settings you had for RAID0?
> > Chunk size was 8KB with four disks.  
> > > 
> 
> Did you try with much bigger sizes.. 128 kB ?
> 
> > > What kind of performance do you get using just a single iscsi session (and
> > > thus just a single path), no multipathing, no DM RAID0 ? Just a filesystem
> > > directly on top of the iscsi /dev/sd? device.
> > Miserable - same roughly 12 MB/s.
> 
> OK, Here's your problem. Was this btw reads or writes? Did you tune
> readahead-settings? 
> 
> Can paste your iSCSI session settings negotiated with the target? 
> 
> > > 
> > > Sounds like there's some other problem if invidual throughput is bad? Or did
> > > you mean performance with a single disktest IO thread is bad, but using multiple
> > > disktest threads it's good.. that would make more sense :) 
> > Yes, the latter.  Single thread (I assume mimicking a single disk
> > operation, e.g., copying a large file) is miserable - much slower than
> > local disk despite the availability of huge bandwidth.  We start
> > utilizing the bandwidth when multiplying concurrent disk activity into
> > the hundreds.
> > 
> > I am guessing the single thread performance problem is an open-iscsi
> > issue but I was hoping multipath would help us work around it by
> > utilizing multiple sessions per disk operation.  I suppose that is where
> > we run into the command ordering problem unless there is something else
> > afoot.  Thanks - John
> 
> You should be able to get many times the throughput you get now.. just with
> a single path/session.
> 
> What kind of latency do you have from the initiator to the target/storage? 
> 
> Try with for example 4 kB ping:
> ping -s 4096 <ip_of_the_iscsi_target>
> 
> 1000ms divided by the roundtrip you get from ping should give you maximum
> possible IOPS using a single path.. 
> 
> 4 kB * IOPS == max bandwidth you can achieve.
> 

Maybe I should have been more clear about that.. assuming you're measuring 
4 kB IO's with disktest, and you have 1 outstanding IO at a time, then the
above is max throughput you can get.

Higher block/IO size and higher number of outstanding IOs will give you
better thoughput. 

I think CFQ disk elevator/scheduler has a bug that prevents queue depths
bigger than 1 outstanding IO.. so don't use that. "noop" might be a good idea.

-- Pasi

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Shell Scripts or Arbitrary Priority Callouts?
  2009-03-24 15:01                   ` Pasi Kärkkäinen
  2009-03-24 15:09                     ` Pasi Kärkkäinen
@ 2009-03-24 15:43                     ` John A. Sullivan III
  2009-03-24 16:36                       ` Pasi Kärkkäinen
  1 sibling, 1 reply; 34+ messages in thread
From: John A. Sullivan III @ 2009-03-24 15:43 UTC (permalink / raw)
  To: device-mapper development; +Cc: Ross S. W. Walker

I greatly appreciate the help.  I'll answer in the thread below as well
as consolidating answers to the questions posed in your other email.

On Tue, 2009-03-24 at 17:01 +0200, Pasi Kärkkäinen wrote:
> On Tue, Mar 24, 2009 at 08:21:45AM -0400, John A. Sullivan III wrote:
> > > 
> > > Core-iscsi developer seems to be active developing at least the 
> > > new iSCSI target (LIO target).. I think he has been testing it with
> > > core-iscsi, so maybe there's newer version somewhere? 
> > > 
> > > > We did play with the multipath rr_min_io settings and smaller always
> > > > seemed to be better until we got into very large numbers of session.  We
> > > > were testing on a dual quad core AMD Shanghai 2378 system with 32 GB
> > > > RAM, a quad port Intel e1000 card and two on-board nvidia forcedeth
> > > > ports with disktest using 4K blocks to mimic the file system using
> > > > sequential reads (and some sequential writes).
> > > > 
> > > 
> > > Nice hardware. Btw are you using jumbo frames or flow control for iSCSI
> > > traffic? 
> > > 
> 
> Dunno if you noticed this.. :) 
We are actually quite enthusiastic about the environment and the
project.  We hope to have many of these hosting about 400 VServer guests
running virtual desktops from the X2Go project.  It's not my project but
I don't mind plugging them as I think it is a great technology.

We are using jumbo frames.  The ProCurve 2810 switches explicitly state
to NOT use flow control and jumbo frames simultaneously.  We tried it
anyway but with poor results.
> 
> 
> > > > 
> > > 
> > > When you used dm RAID0 you didn't have any multipath configuration, right? 
> > Correct although we also did test successfully with multipath in
> > failover mode and RAID0.
> > > 
> 
> OK.
> 
> > > What kind of stripe size and other settings you had for RAID0?
> > Chunk size was 8KB with four disks.  
> > > 
> 
> Did you try with much bigger sizes.. 128 kB ?
We tried slightly larger sizes - 16KB and 32KB I believe and observed
performance degradation.  In fact, in some scenarios 4KB chunk sizes
gave us better performance than 8KB.
> 
> > > What kind of performance do you get using just a single iscsi session (and
> > > thus just a single path), no multipathing, no DM RAID0 ? Just a filesystem
> > > directly on top of the iscsi /dev/sd? device.
> > Miserable - same roughly 12 MB/s.
> 
> OK, Here's your problem. Was this btw reads or writes? Did you tune
> readahead-settings? 
12MBps is sequential reading but sequential writing is not much
different.  We did tweak readahead to 1024. We did not want to go much
larger in order to maintain balance with the various data patterns -
some of which are random and some of which may not read linearly.
> 
> Can paste your iSCSI session settings negotiated with the target? 
Pardon my ignorance :( but, other than packet traces, how do I show the
final negotiated settings?
> 
> > > 
> > > Sounds like there's some other problem if invidual throughput is bad? Or did
> > > you mean performance with a single disktest IO thread is bad, but using multiple
> > > disktest threads it's good.. that would make more sense :) 
> > Yes, the latter.  Single thread (I assume mimicking a single disk
> > operation, e.g., copying a large file) is miserable - much slower than
> > local disk despite the availability of huge bandwidth.  We start
> > utilizing the bandwidth when multiplying concurrent disk activity into
> > the hundreds.
> > 
> > I am guessing the single thread performance problem is an open-iscsi
> > issue but I was hoping multipath would help us work around it by
> > utilizing multiple sessions per disk operation.  I suppose that is where
> > we run into the command ordering problem unless there is something else
> > afoot.  Thanks - John
> 
> You should be able to get many times the throughput you get now.. just with
> a single path/session.
> 
> What kind of latency do you have from the initiator to the target/storage? 
> 
> Try with for example 4 kB ping:
> ping -s 4096 <ip_of_the_iscsi_target>
We have about 400 micro seconds - that seems a bit high :(
rtt min/avg/max/mdev = 0.275/0.337/0.398/0.047 ms

> 
> 1000ms divided by the roundtrip you get from ping should give you maximum
> possible IOPS using a single path.. 
> 
1000 / 0.4 = 2500
> 4 kB * IOPS == max bandwidth you can achieve.
2500 * 4KB = 10 MBps
Hmm . . . seems like what we are getting.  Is that an abnormally high
latency? We have tried playing with interrupt coalescing on the
initiator side but without significant effect.  Thanks for putting
together the formula for me.  Not only does it help me understand but it
means I can work on addressing the latency issue without setting up and
running disk tests.

I would love to use larger block sizes as you suggest in your other
email but, on AMD64, I believe we are stuck with 4KB.  I've not seen any
way to change it and would gladly do so if someone knows how.

CFQ was indeed a problem.  It would not scale with increasing the number
of threads.  noop, deadline, and anticipatory all fared much better.  We
are currently using noop for the iSCSI targets.  Thanks again - John
> 
> -- Pasi
> 
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel
-- 
John A. Sullivan III
Open Source Development Corporation
+1 207-985-7880
jsullivan@opensourcedevel.com

http://www.spiritualoutreach.com
Making Christianity intelligible to secular society

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Shell Scripts or Arbitrary Priority Callouts?
  2009-03-24 15:43                     ` John A. Sullivan III
@ 2009-03-24 16:36                       ` Pasi Kärkkäinen
  2009-03-24 17:30                         ` John A. Sullivan III
  0 siblings, 1 reply; 34+ messages in thread
From: Pasi Kärkkäinen @ 2009-03-24 16:36 UTC (permalink / raw)
  To: device-mapper development; +Cc: Ross S. W. Walker

On Tue, Mar 24, 2009 at 11:43:20AM -0400, John A. Sullivan III wrote:
> I greatly appreciate the help.  I'll answer in the thread below as well
> as consolidating answers to the questions posed in your other email.
> 
> On Tue, 2009-03-24 at 17:01 +0200, Pasi Kärkkäinen wrote:
> > On Tue, Mar 24, 2009 at 08:21:45AM -0400, John A. Sullivan III wrote:
> > > > 
> > > > Core-iscsi developer seems to be active developing at least the 
> > > > new iSCSI target (LIO target).. I think he has been testing it with
> > > > core-iscsi, so maybe there's newer version somewhere? 
> > > > 
> > > > > We did play with the multipath rr_min_io settings and smaller always
> > > > > seemed to be better until we got into very large numbers of session.  We
> > > > > were testing on a dual quad core AMD Shanghai 2378 system with 32 GB
> > > > > RAM, a quad port Intel e1000 card and two on-board nvidia forcedeth
> > > > > ports with disktest using 4K blocks to mimic the file system using
> > > > > sequential reads (and some sequential writes).
> > > > > 
> > > > 
> > > > Nice hardware. Btw are you using jumbo frames or flow control for iSCSI
> > > > traffic? 
> > > > 
> > 
> > Dunno if you noticed this.. :) 
> We are actually quite enthusiastic about the environment and the
> project.  We hope to have many of these hosting about 400 VServer guests
> running virtual desktops from the X2Go project.  It's not my project but
> I don't mind plugging them as I think it is a great technology.
> 
> We are using jumbo frames.  The ProCurve 2810 switches explicitly state
> to NOT use flow control and jumbo frames simultaneously.  We tried it
> anyway but with poor results.

Ok. 

iirc 2810 does not have very big buffers per port, so you might be better
using flow control instead of jumbos.. then again I'm not sure how good flow
control implementation HP has? 

The whole point of flow control is to prevent packet loss/drop.. this happens
with sending pause frames before the port buffers get full. If port buffers
get full then the switch doesn't have any other option than to drop the
packets.. and this causes tcp-retransmits -> causes delay and tcp slows down
to prevent further packet drops.

flow control "pause frames" cause less delay than tcp-retransmits. 

Do you see tcp retransmits with "netstat -s" ? Check both the target and the initiators.

> > 
> > 
> > > > > 
> > > > 
> > > > When you used dm RAID0 you didn't have any multipath configuration, right? 
> > > Correct although we also did test successfully with multipath in
> > > failover mode and RAID0.
> > > > 
> > 
> > OK.
> > 
> > > > What kind of stripe size and other settings you had for RAID0?
> > > Chunk size was 8KB with four disks.  
> > > > 
> > 
> > Did you try with much bigger sizes.. 128 kB ?
> We tried slightly larger sizes - 16KB and 32KB I believe and observed
> performance degradation.  In fact, in some scenarios 4KB chunk sizes
> gave us better performance than 8KB.

Ok. 

> > 
> > > > What kind of performance do you get using just a single iscsi session (and
> > > > thus just a single path), no multipathing, no DM RAID0 ? Just a filesystem
> > > > directly on top of the iscsi /dev/sd? device.
> > > Miserable - same roughly 12 MB/s.
> > 
> > OK, Here's your problem. Was this btw reads or writes? Did you tune
> > readahead-settings? 
> 12MBps is sequential reading but sequential writing is not much
> different.  We did tweak readahead to 1024. We did not want to go much
> larger in order to maintain balance with the various data patterns -
> some of which are random and some of which may not read linearly.

I did some benchmarking earlier between two servers; other one running ietd
target with 'nullio' and other running open-iscsi initiator. Both using a single gigabit NIC. 

I remember getting very close to full gigabit speed at least with bigger
block sizes. I can't remember how much I got with 4 kB blocks. 

Those tests were made with dd.

nullio target is a good way to benchmark your network and initiator and
verify everything is correct. 

Also it's good to first test for example with FTP and Iperf to verify
network is working properly between target and the initiator and all the
other basic settings are correct.

Btw have you configured tcp stacks of the servers? Bigger default tcp window
size, bigger maximun tcp window size etc.. 

> > 
> > Can paste your iSCSI session settings negotiated with the target? 
> Pardon my ignorance :( but, other than packet traces, how do I show the
> final negotiated settings?

Try:

iscsiadm -i -m session
iscsiadm -m session -P3


> > 
> > > > 
> > > > Sounds like there's some other problem if invidual throughput is bad? Or did
> > > > you mean performance with a single disktest IO thread is bad, but using multiple
> > > > disktest threads it's good.. that would make more sense :) 
> > > Yes, the latter.  Single thread (I assume mimicking a single disk
> > > operation, e.g., copying a large file) is miserable - much slower than
> > > local disk despite the availability of huge bandwidth.  We start
> > > utilizing the bandwidth when multiplying concurrent disk activity into
> > > the hundreds.
> > > 
> > > I am guessing the single thread performance problem is an open-iscsi
> > > issue but I was hoping multipath would help us work around it by
> > > utilizing multiple sessions per disk operation.  I suppose that is where
> > > we run into the command ordering problem unless there is something else
> > > afoot.  Thanks - John
> > 
> > You should be able to get many times the throughput you get now.. just with
> > a single path/session.
> > 
> > What kind of latency do you have from the initiator to the target/storage? 
> > 
> > Try with for example 4 kB ping:
> > ping -s 4096 <ip_of_the_iscsi_target>
> We have about 400 micro seconds - that seems a bit high :(
> rtt min/avg/max/mdev = 0.275/0.337/0.398/0.047 ms
> 

Yeah.. that's a bit high. 

> > 
> > 1000ms divided by the roundtrip you get from ping should give you maximum
> > possible IOPS using a single path.. 
> > 
> 1000 / 0.4 = 2500
> > 4 kB * IOPS == max bandwidth you can achieve.
> 2500 * 4KB = 10 MBps
> Hmm . . . seems like what we are getting.  Is that an abnormally high
> latency? We have tried playing with interrupt coalescing on the
> initiator side but without significant effect.  Thanks for putting
> together the formula for me.  Not only does it help me understand but it
> means I can work on addressing the latency issue without setting up and
> running disk tests.
> 

I think Ross suggested in some other thread the following settings for e1000
NICs:

"Set the e1000s InterruptThrottleRate=1 and their TxRingBufferSize=4096
and RxRingBufferSize=4096 (verify those option names with a modinfo)
and add those to modprobe.conf."

> I would love to use larger block sizes as you suggest in your other
> email but, on AMD64, I believe we are stuck with 4KB.  I've not seen any
> way to change it and would gladly do so if someone knows how.
> 

Are we talking about filesystem block sizes? That shouldn't be a problem if
your application uses larger blocksizes for read/write operations.. 

Try for example with:
dd if=/dev/zero of=/iscsilun/file.bin bs=1024k count=1024

and optionally add "oflag=direct" (or iflag=direct) if you want to make sure 
caches do not mess up the results. 

> CFQ was indeed a problem.  It would not scale with increasing the number
> of threads.  noop, deadline, and anticipatory all fared much better.  We
> are currently using noop for the iSCSI targets.  Thanks again - John

Yep. And no problems.. hopefully I'm able to help and guide to right
direction :)  

-- Pasi

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Shell Scripts or Arbitrary Priority Callouts?
  2009-03-24 16:36                       ` Pasi Kärkkäinen
@ 2009-03-24 17:30                         ` John A. Sullivan III
  2009-03-24 18:17                           ` Pasi Kärkkäinen
  0 siblings, 1 reply; 34+ messages in thread
From: John A. Sullivan III @ 2009-03-24 17:30 UTC (permalink / raw)
  To: device-mapper development; +Cc: Ross S. W. Walker

Thanks very much, again, and, again, I'll reply in the text - John

On Tue, 2009-03-24 at 18:36 +0200, Pasi Kärkkäinen wrote:
> On Tue, Mar 24, 2009 at 11:43:20AM -0400, John A. Sullivan III wrote:
> > I greatly appreciate the help.  I'll answer in the thread below as well
> > as consolidating answers to the questions posed in your other email.
> > 
> > On Tue, 2009-03-24 at 17:01 +0200, Pasi Kärkkäinen wrote:
> > > On Tue, Mar 24, 2009 at 08:21:45AM -0400, John A. Sullivan III wrote:
> > > > > 
> > > > > Core-iscsi developer seems to be active developing at least the 
> > > > > new iSCSI target (LIO target).. I think he has been testing it with
> > > > > core-iscsi, so maybe there's newer version somewhere? 
> > > > > 
> > > > > > We did play with the multipath rr_min_io settings and smaller always
> > > > > > seemed to be better until we got into very large numbers of session.  We
> > > > > > were testing on a dual quad core AMD Shanghai 2378 system with 32 GB
> > > > > > RAM, a quad port Intel e1000 card and two on-board nvidia forcedeth
> > > > > > ports with disktest using 4K blocks to mimic the file system using
> > > > > > sequential reads (and some sequential writes).
> > > > > > 
> > > > > 
> > > > > Nice hardware. Btw are you using jumbo frames or flow control for iSCSI
> > > > > traffic? 
> > > > > 
> > > 
> > > Dunno if you noticed this.. :) 
> > We are actually quite enthusiastic about the environment and the
> > project.  We hope to have many of these hosting about 400 VServer guests
> > running virtual desktops from the X2Go project.  It's not my project but
> > I don't mind plugging them as I think it is a great technology.
> > 
> > We are using jumbo frames.  The ProCurve 2810 switches explicitly state
> > to NOT use flow control and jumbo frames simultaneously.  We tried it
> > anyway but with poor results.
> 
> Ok. 
> 
> iirc 2810 does not have very big buffers per port, so you might be better
> using flow control instead of jumbos.. then again I'm not sure how good flow
> control implementation HP has? 
> 
> The whole point of flow control is to prevent packet loss/drop.. this happens
> with sending pause frames before the port buffers get full. If port buffers
> get full then the switch doesn't have any other option than to drop the
> packets.. and this causes tcp-retransmits -> causes delay and tcp slows down
> to prevent further packet drops.
> 
> flow control "pause frames" cause less delay than tcp-retransmits. 
> 
> Do you see tcp retransmits with "netstat -s" ? Check both the target and the initiators.
Thankfully this is an area of some expertise for me (unlike disk I/O -
obviously ;)  ).  We have been pretty thorough about checking the
network path.  We've not seen any upper layer retransmission or buffer
overflows.
> 
> > > 
> > > 
> > > > > > 
> > > > > 
> > > > > When you used dm RAID0 you didn't have any multipath configuration, right? 
> > > > Correct although we also did test successfully with multipath in
> > > > failover mode and RAID0.
> > > > > 
> > > 
> > > OK.
> > > 
> > > > > What kind of stripe size and other settings you had for RAID0?
> > > > Chunk size was 8KB with four disks.  
> > > > > 
> > > 
> > > Did you try with much bigger sizes.. 128 kB ?
> > We tried slightly larger sizes - 16KB and 32KB I believe and observed
> > performance degradation.  In fact, in some scenarios 4KB chunk sizes
> > gave us better performance than 8KB.
> 
> Ok. 
> 
> > > 
> > > > > What kind of performance do you get using just a single iscsi session (and
> > > > > thus just a single path), no multipathing, no DM RAID0 ? Just a filesystem
> > > > > directly on top of the iscsi /dev/sd? device.
> > > > Miserable - same roughly 12 MB/s.
> > > 
> > > OK, Here's your problem. Was this btw reads or writes? Did you tune
> > > readahead-settings? 
> > 12MBps is sequential reading but sequential writing is not much
> > different.  We did tweak readahead to 1024. We did not want to go much
> > larger in order to maintain balance with the various data patterns -
> > some of which are random and some of which may not read linearly.
> 
> I did some benchmarking earlier between two servers; other one running ietd
> target with 'nullio' and other running open-iscsi initiator. Both using a single gigabit NIC. 
> 
> I remember getting very close to full gigabit speed at least with bigger
> block sizes. I can't remember how much I got with 4 kB blocks. 
> 
> Those tests were made with dd.
Yes, if we use 64KB blocks, we can saturate a Gig link.  With larger
sizes, we can push over 3 Gpbs over the four gig links in the test
environment.
> 
> nullio target is a good way to benchmark your network and initiator and
> verify everything is correct. 
> 
> Also it's good to first test for example with FTP and Iperf to verify
> network is working properly between target and the initiator and all the
> other basic settings are correct.
We did flood ping the network and had all interfaces operating at near
capacity.  The network itself looks very healthy.
> 
> Btw have you configured tcp stacks of the servers? Bigger default tcp window
> size, bigger maximun tcp window size etc.. 
Yep, tweaked transmit queue length, receive and transmit windows, net
device backlogs, buffer space, disabled nagle, and even played with the
dirty page watermarks.
> 
> > > 
> > > Can paste your iSCSI session settings negotiated with the target? 
> > Pardon my ignorance :( but, other than packet traces, how do I show the
> > final negotiated settings?
> 
> Try:
> 
> iscsiadm -i -m session
> iscsiadm -m session -P3
> 
Here's what it says.  Pretty much as expected.  We are using COMSTAR on
the target and took some traces to see what COMSTAR was expecting. We
set the open-iscsi parameters to match:

Current Portal: 172.x.x.174:3260,2
        Persistent Portal: 172.x.x.174:3260,2
                **********
                Interface:
                **********
                Iface Name: default
                Iface Transport: tcp
                Iface Initiatorname: iqn.2008-05.biz.ssi:vd-gen
                Iface IPaddress: 172.x.x.162
                Iface HWaddress: default
                Iface Netdev: default
                SID: 32
                iSCSI Connection State: LOGGED IN
                iSCSI Session State: LOGGED_IN
                Internal iscsid Session State: NO CHANGE
                ************************
                Negotiated iSCSI params:
                ************************
                HeaderDigest: None
                DataDigest: None
                MaxRecvDataSegmentLength: 131072
                MaxXmitDataSegmentLength: 8192
                FirstBurstLength: 65536
                MaxBurstLength: 524288
                ImmediateData: Yes
                InitialR2T: Yes
                MaxOutstandingR2T: 1
                ************************
                Attached SCSI devices:
                ************************
                Host Number: 39 State: running
                scsi39 Channel 00 Id 0 Lun: 0
                        Attached scsi disk sdah         State: running

> 
> > > 
> > > > > 
> > > > > Sounds like there's some other problem if invidual throughput is bad? Or did
> > > > > you mean performance with a single disktest IO thread is bad, but using multiple
> > > > > disktest threads it's good.. that would make more sense :) 
> > > > Yes, the latter.  Single thread (I assume mimicking a single disk
> > > > operation, e.g., copying a large file) is miserable - much slower than
> > > > local disk despite the availability of huge bandwidth.  We start
> > > > utilizing the bandwidth when multiplying concurrent disk activity into
> > > > the hundreds.
> > > > 
> > > > I am guessing the single thread performance problem is an open-iscsi
> > > > issue but I was hoping multipath would help us work around it by
> > > > utilizing multiple sessions per disk operation.  I suppose that is where
> > > > we run into the command ordering problem unless there is something else
> > > > afoot.  Thanks - John
> > > 
> > > You should be able to get many times the throughput you get now.. just with
> > > a single path/session.
> > > 
> > > What kind of latency do you have from the initiator to the target/storage? 
> > > 
> > > Try with for example 4 kB ping:
> > > ping -s 4096 <ip_of_the_iscsi_target>
> > We have about 400 micro seconds - that seems a bit high :(
> > rtt min/avg/max/mdev = 0.275/0.337/0.398/0.047 ms
> > 
> 
> Yeah.. that's a bit high. 
Actually, with more testing, we're seeing it stretch up to over 700
micro-seconds.  I'll attach a raft of data I collected at the end of
this email.
> 
> > > 
> > > 1000ms divided by the roundtrip you get from ping should give you maximum
> > > possible IOPS using a single path.. 
> > > 
> > 1000 / 0.4 = 2500
> > > 4 kB * IOPS == max bandwidth you can achieve.
> > 2500 * 4KB = 10 MBps
> > Hmm . . . seems like what we are getting.  Is that an abnormally high
> > latency? We have tried playing with interrupt coalescing on the
> > initiator side but without significant effect.  Thanks for putting
> > together the formula for me.  Not only does it help me understand but it
> > means I can work on addressing the latency issue without setting up and
> > running disk tests.
> > 
> 
> I think Ross suggested in some other thread the following settings for e1000
> NICs:
> 
> "Set the e1000s InterruptThrottleRate=1 and their TxRingBufferSize=4096
> and RxRingBufferSize=4096 (verify those option names with a modinfo)
> and add those to modprobe.conf."
We did try playing with the ring buffer but to no avail.  Modinfo does
not seem to display the current settings.  We did try playing with
setting the InterruptThrottleRate to 1 but again to no avail.  As I'll
mention later, I suspect the issue might be the opensolaris based
target.
> 
> > I would love to use larger block sizes as you suggest in your other
> > email but, on AMD64, I believe we are stuck with 4KB.  I've not seen any
> > way to change it and would gladly do so if someone knows how.
> > 
> 
> Are we talking about filesystem block sizes? That shouldn't be a problem if
> your application uses larger blocksizes for read/write operations.. 
> 
Yes, file system block size.  When we try rough, end user style tests,
e.g., large file copies, we seem to get the performance indicated by 4KB
blocks, i.e., lousy!
> Try for example with:
> dd if=/dev/zero of=/iscsilun/file.bin bs=1024k count=1024
Large block sizes can make the system truly fly so we suspect you are
absolutely correct about latency being the issue.  We did do our testing
with raw interfaces by the way.
> 
> and optionally add "oflag=direct" (or iflag=direct) if you want to make sure 
> caches do not mess up the results. 
> 
> > CFQ was indeed a problem.  It would not scale with increasing the number
> > of threads.  noop, deadline, and anticipatory all fared much better.  We
> > are currently using noop for the iSCSI targets.  Thanks again - John
> 
> Yep. And no problems.. hopefully I'm able to help and guide to right
> direction :)  
<snip>
I did a little digging and calculating and here is what I came up with
and sent to Nexenta.  Please tell me if I am on the right track.

I am using jumbo frames and should be able to get 2 4KB blocks
per frame.  Total size should be 8192 + 78 (TCP + IP + Ethernet + CRC
-oops we need to add iSCSI -what size is the iSCSI header?) + 12
(interframe gap) = 8282 bytes.  Transmission latency should be 8282 *
8 / 1,000,000,000 = 66.3 micro-seconds.  Switch latency is 5.7
microseconds so let's say network latency is 72 - well let's say 75
micro-seconds.  The only additional latency should be added by the
network stacks on the target and initiator.

Current round trip latency between the initiator (Linux) and target
(Nexenta) is around 400 micro-seconds and fluctuates significantly:

Hmm . .  this is worse than the last test:
PING 172.30.13.158 (172.30.13.158) 8192(8220) bytes of data.
8200 bytes from 172.30.13.158: icmp_seq=1 ttl=255 time=1.36 ms
8200 bytes from 172.30.13.158: icmp_seq=2 ttl=255 time=0.638 ms
8200 bytes from 172.30.13.158: icmp_seq=3 ttl=255 time=0.622 ms
8200 bytes from 172.30.13.158: icmp_seq=4 ttl=255 time=0.603 ms
8200 bytes from 172.30.13.158: icmp_seq=5 ttl=255 time=0.586 ms
8200 bytes from 172.30.13.158: icmp_seq=6 ttl=255 time=0.564 ms
8200 bytes from 172.30.13.158: icmp_seq=7 ttl=255 time=0.553 ms
8200 bytes from 172.30.13.158: icmp_seq=8 ttl=255 time=0.525 ms
8200 bytes from 172.30.13.158: icmp_seq=9 ttl=255 time=0.508 ms
8200 bytes from 172.30.13.158: icmp_seq=10 ttl=255 time=0.490 ms
8200 bytes from 172.30.13.158: icmp_seq=11 ttl=255 time=0.472 ms
8200 bytes from 172.30.13.158: icmp_seq=12 ttl=255 time=0.454 ms
8200 bytes from 172.30.13.158: icmp_seq=13 ttl=255 time=0.436 ms
8200 bytes from 172.30.13.158: icmp_seq=14 ttl=255 time=0.674 ms
8200 bytes from 172.30.13.158: icmp_seq=15 ttl=255 time=0.399 ms
8200 bytes from 172.30.13.158: icmp_seq=16 ttl=255 time=0.638 ms
8200 bytes from 172.30.13.158: icmp_seq=17 ttl=255 time=0.620 ms
8200 bytes from 172.30.13.158: icmp_seq=18 ttl=255 time=0.601 ms
8200 bytes from 172.30.13.158: icmp_seq=19 ttl=255 time=0.583 ms
8200 bytes from 172.30.13.158: icmp_seq=20 ttl=255 time=0.563 ms
8200 bytes from 172.30.13.158: icmp_seq=21 ttl=255 time=0.546 ms
8200 bytes from 172.30.13.158: icmp_seq=22 ttl=255 time=0.518 ms
8200 bytes from 172.30.13.158: icmp_seq=23 ttl=255 time=0.501 ms
8200 bytes from 172.30.13.158: icmp_seq=24 ttl=255 time=0.481 ms
8200 bytes from 172.30.13.158: icmp_seq=25 ttl=255 time=0.463 ms
8200 bytes from 172.30.13.158: icmp_seq=26 ttl=255 time=0.443 ms
8200 bytes from 172.30.13.158: icmp_seq=27 ttl=255 time=0.682 ms
8200 bytes from 172.30.13.158: icmp_seq=28 ttl=255 time=0.404 ms
8200 bytes from 172.30.13.158: icmp_seq=29 ttl=255 time=0.644 ms
8200 bytes from 172.30.13.158: icmp_seq=30 ttl=255 time=0.624 ms
8200 bytes from 172.30.13.158: icmp_seq=31 ttl=255 time=0.605 ms
8200 bytes from 172.30.13.158: icmp_seq=32 ttl=255 time=0.586 ms
8200 bytes from 172.30.13.158: icmp_seq=33 ttl=255 time=0.566 ms
^C
--- 172.30.13.158 ping statistics ---
33 packets transmitted, 33 received, 0% packet loss, time 32000ms
rtt min/avg/max/mdev = 0.399/0.574/1.366/0.161 ms

There is nothing going on in the network.  So we are seeing 574
micro-seconds total with only 150 micro-seconds attributed to
transmission.  And we see a wide variation in latency.

I then tested the latency between interfaces on the initiator and the
target.  Here is what I get for internal latency on the Linux initiator:
PING 172.30.13.18 (172.30.13.18) from 172.30.13.146 : 8192(8220) bytes
of data.
8200 bytes from 172.30.13.18: icmp_seq=1 ttl=64 time=0.033 ms
8200 bytes from 172.30.13.18: icmp_seq=2 ttl=64 time=0.019 ms
8200 bytes from 172.30.13.18: icmp_seq=3 ttl=64 time=0.019 ms
8200 bytes from 172.30.13.18: icmp_seq=4 ttl=64 time=0.018 ms
8200 bytes from 172.30.13.18: icmp_seq=5 ttl=64 time=0.018 ms
8200 bytes from 172.30.13.18: icmp_seq=6 ttl=64 time=0.017 ms
8200 bytes from 172.30.13.18: icmp_seq=7 ttl=64 time=0.018 ms
8200 bytes from 172.30.13.18: icmp_seq=8 ttl=64 time=0.018 ms
8200 bytes from 172.30.13.18: icmp_seq=9 ttl=64 time=0.019 ms
8200 bytes from 172.30.13.18: icmp_seq=10 ttl=64 time=0.018 ms
8200 bytes from 172.30.13.18: icmp_seq=11 ttl=64 time=0.019 ms
8200 bytes from 172.30.13.18: icmp_seq=12 ttl=64 time=0.018 ms
8200 bytes from 172.30.13.18: icmp_seq=13 ttl=64 time=0.018 ms
8200 bytes from 172.30.13.18: icmp_seq=14 ttl=64 time=0.018 ms
8200 bytes from 172.30.13.18: icmp_seq=15 ttl=64 time=0.019 ms
8200 bytes from 172.30.13.18: icmp_seq=16 ttl=64 time=0.017 ms
8200 bytes from 172.30.13.18: icmp_seq=17 ttl=64 time=0.019 ms
8200 bytes from 172.30.13.18: icmp_seq=18 ttl=64 time=0.017 ms
8200 bytes from 172.30.13.18: icmp_seq=19 ttl=64 time=0.019 ms
8200 bytes from 172.30.13.18: icmp_seq=20 ttl=64 time=0.018 ms
8200 bytes from 172.30.13.18: icmp_seq=21 ttl=64 time=0.018 ms
8200 bytes from 172.30.13.18: icmp_seq=22 ttl=64 time=0.018 ms
8200 bytes from 172.30.13.18: icmp_seq=23 ttl=64 time=0.019 ms
8200 bytes from 172.30.13.18: icmp_seq=24 ttl=64 time=0.018 ms
8200 bytes from 172.30.13.18: icmp_seq=25 ttl=64 time=0.018 ms
8200 bytes from 172.30.13.18: icmp_seq=26 ttl=64 time=0.017 ms
8200 bytes from 172.30.13.18: icmp_seq=27 ttl=64 time=0.019 ms
8200 bytes from 172.30.13.18: icmp_seq=28 ttl=64 time=0.018 ms
8200 bytes from 172.30.13.18: icmp_seq=29 ttl=64 time=0.018 ms
^C
--- 172.30.13.18 ping statistics ---
29 packets transmitted, 29 received, 0% packet loss, time 27999ms
rtt min/avg/max/mdev = 0.017/0.018/0.033/0.005 ms

A very consistent 18 micro-seconds.

Here is what I get from the Z200:
root@disk01:/etc# ping -s -i e1000g6 172.30.13.190 4096
PING 172.30.13.190: 4096 data bytes
4104 bytes from 172.30.13.190: icmp_seq=0. time=0.104 ms
4104 bytes from 172.30.13.190: icmp_seq=1. time=0.081 ms
4104 bytes from 172.30.13.190: icmp_seq=2. time=0.067 ms
4104 bytes from 172.30.13.190: icmp_seq=3. time=0.083 ms
4104 bytes from 172.30.13.190: icmp_seq=4. time=0.097 ms
4104 bytes from 172.30.13.190: icmp_seq=5. time=0.043 ms
4104 bytes from 172.30.13.190: icmp_seq=6. time=0.048 ms
4104 bytes from 172.30.13.190: icmp_seq=7. time=0.050 ms
4104 bytes from 172.30.13.190: icmp_seq=8. time=0.043 ms
4104 bytes from 172.30.13.190: icmp_seq=9. time=0.043 ms
4104 bytes from 172.30.13.190: icmp_seq=10. time=0.043 ms
4104 bytes from 172.30.13.190: icmp_seq=11. time=0.042 ms
4104 bytes from 172.30.13.190: icmp_seq=12. time=0.043 ms
4104 bytes from 172.30.13.190: icmp_seq=13. time=0.043 ms
4104 bytes from 172.30.13.190: icmp_seq=14. time=0.042 ms
4104 bytes from 172.30.13.190: icmp_seq=15. time=0.047 ms
4104 bytes from 172.30.13.190: icmp_seq=16. time=0.072 ms
4104 bytes from 172.30.13.190: icmp_seq=17. time=0.080 ms
4104 bytes from 172.30.13.190: icmp_seq=18. time=0.070 ms
4104 bytes from 172.30.13.190: icmp_seq=19. time=0.066 ms
4104 bytes from 172.30.13.190: icmp_seq=20. time=0.086 ms
4104 bytes from 172.30.13.190: icmp_seq=21. time=0.068 ms
4104 bytes from 172.30.13.190: icmp_seq=22. time=0.079 ms
4104 bytes from 172.30.13.190: icmp_seq=23. time=0.068 ms
4104 bytes from 172.30.13.190: icmp_seq=24. time=0.069 ms
4104 bytes from 172.30.13.190: icmp_seq=25. time=0.070 ms
4104 bytes from 172.30.13.190: icmp_seq=26. time=0.095 ms
4104 bytes from 172.30.13.190: icmp_seq=27. time=0.095 ms
4104 bytes from 172.30.13.190: icmp_seq=28. time=0.073 ms
4104 bytes from 172.30.13.190: icmp_seq=29. time=0.071 ms
4104 bytes from 172.30.13.190: icmp_seq=30. time=0.071 ms
^C
----172.30.13.190 PING Statistics----
31 packets transmitted, 31 packets received, 0% packet loss
round-trip (ms)  min/avg/max/stddev = 0.042/0.066/0.104/0.019

Notice it is several times longer latency with much wider variation.
How to we tune the opensolaris network stack to reduce it's latency? I'd
really like to improve the individual user experience.  I can tell them
it's like commuting to work on the train instead of the car during rush
hour - faster when there's lots of traffic but slower when there is not,
but they will judge the product by their individual experiences more
than their collective experiences.  Thus, I really want to improve the
individual disk operation throughput.

Latency seems to be our key.  If I can add only 20 micro-seconds of
latency from initiator and target each, that would be roughly 200 micro
seconds.  That would almost triple the throughput from what we are
currently seeing.

Unfortunately, I'm a bit ignorant of tweaking networks on opensolaris.
I can certainly learn but am I headed in the right direction or is this
direction of investigation misguided? Thanks - John

-- 
John A. Sullivan III
Open Source Development Corporation
+1 207-985-7880
jsullivan@opensourcedevel.com

http://www.spiritualoutreach.com
Making Christianity intelligible to secular society

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Shell Scripts or Arbitrary Priority Callouts?
  2009-03-24 17:30                         ` John A. Sullivan III
@ 2009-03-24 18:17                           ` Pasi Kärkkäinen
  2009-03-25  3:41                             ` John A. Sullivan III
  2009-03-25  3:44                             ` John A. Sullivan III
  0 siblings, 2 replies; 34+ messages in thread
From: Pasi Kärkkäinen @ 2009-03-24 18:17 UTC (permalink / raw)
  To: device-mapper development; +Cc: Ross S. W. Walker

On Tue, Mar 24, 2009 at 01:30:10PM -0400, John A. Sullivan III wrote:
> Thanks very much, again, and, again, I'll reply in the text - John
> 

Np :)

> > 
> > iirc 2810 does not have very big buffers per port, so you might be better
> > using flow control instead of jumbos.. then again I'm not sure how good flow
> > control implementation HP has? 
> > 
> > The whole point of flow control is to prevent packet loss/drop.. this happens
> > with sending pause frames before the port buffers get full. If port buffers
> > get full then the switch doesn't have any other option than to drop the
> > packets.. and this causes tcp-retransmits -> causes delay and tcp slows down
> > to prevent further packet drops.
> > 
> > flow control "pause frames" cause less delay than tcp-retransmits. 
> > 
> > Do you see tcp retransmits with "netstat -s" ? Check both the target and the initiators.
> Thankfully this is an area of some expertise for me (unlike disk I/O -
> obviously ;)  ).  We have been pretty thorough about checking the
> network path.  We've not seen any upper layer retransmission or buffer
> overflows.

Good :)

> > > > > > What kind of performance do you get using just a single iscsi session (and
> > > > > > thus just a single path), no multipathing, no DM RAID0 ? Just a filesystem
> > > > > > directly on top of the iscsi /dev/sd? device.
> > > > > Miserable - same roughly 12 MB/s.
> > > > 
> > > > OK, Here's your problem. Was this btw reads or writes? Did you tune
> > > > readahead-settings? 
> > > 12MBps is sequential reading but sequential writing is not much
> > > different.  We did tweak readahead to 1024. We did not want to go much
> > > larger in order to maintain balance with the various data patterns -
> > > some of which are random and some of which may not read linearly.
> > 
> > I did some benchmarking earlier between two servers; other one running ietd
> > target with 'nullio' and other running open-iscsi initiator. Both using a single gigabit NIC. 
> > 
> > I remember getting very close to full gigabit speed at least with bigger
> > block sizes. I can't remember how much I got with 4 kB blocks. 
> > 
> > Those tests were made with dd.
> Yes, if we use 64KB blocks, we can saturate a Gig link.  With larger
> sizes, we can push over 3 Gpbs over the four gig links in the test
> environment.

That's good. 

> > 
> > nullio target is a good way to benchmark your network and initiator and
> > verify everything is correct. 
> > 
> > Also it's good to first test for example with FTP and Iperf to verify
> > network is working properly between target and the initiator and all the
> > other basic settings are correct.
> We did flood ping the network and had all interfaces operating at near
> capacity.  The network itself looks very healthy.

Ok. 

> > 
> > Btw have you configured tcp stacks of the servers? Bigger default tcp window
> > size, bigger maximun tcp window size etc.. 
> Yep, tweaked transmit queue length, receive and transmit windows, net
> device backlogs, buffer space, disabled nagle, and even played with the
> dirty page watermarks.

That's all taken care of then :) 

Also on the target? 

> > 
> > > > 
> > > > Can paste your iSCSI session settings negotiated with the target? 
> > > Pardon my ignorance :( but, other than packet traces, how do I show the
> > > final negotiated settings?
> > 
> > Try:
> > 
> > iscsiadm -i -m session
> > iscsiadm -m session -P3
> > 
> Here's what it says.  Pretty much as expected.  We are using COMSTAR on
> the target and took some traces to see what COMSTAR was expecting. We
> set the open-iscsi parameters to match:
> 
> Current Portal: 172.x.x.174:3260,2
>         Persistent Portal: 172.x.x.174:3260,2
>                 **********
>                 Interface:
>                 **********
>                 Iface Name: default
>                 Iface Transport: tcp
>                 Iface Initiatorname: iqn.2008-05.biz.ssi:vd-gen
>                 Iface IPaddress: 172.x.x.162
>                 Iface HWaddress: default
>                 Iface Netdev: default
>                 SID: 32
>                 iSCSI Connection State: LOGGED IN
>                 iSCSI Session State: LOGGED_IN
>                 Internal iscsid Session State: NO CHANGE
>                 ************************
>                 Negotiated iSCSI params:
>                 ************************
>                 HeaderDigest: None
>                 DataDigest: None
>                 MaxRecvDataSegmentLength: 131072
>                 MaxXmitDataSegmentLength: 8192
>                 FirstBurstLength: 65536
>                 MaxBurstLength: 524288
>                 ImmediateData: Yes
>                 InitialR2T: Yes

I guess InitialR2T could be No for a bit better performance? 

MaxXmitDataSegmentLength looks small? 

> > > > You should be able to get many times the throughput you get now.. just with
> > > > a single path/session.
> > > > 
> > > > What kind of latency do you have from the initiator to the target/storage? 
> > > > 
> > > > Try with for example 4 kB ping:
> > > > ping -s 4096 <ip_of_the_iscsi_target>
> > > We have about 400 micro seconds - that seems a bit high :(
> > > rtt min/avg/max/mdev = 0.275/0.337/0.398/0.047 ms
> > > 
> > 
> > Yeah.. that's a bit high. 
> Actually, with more testing, we're seeing it stretch up to over 700
> micro-seconds.  I'll attach a raft of data I collected at the end of
> this email.

Ok.

> > I think Ross suggested in some other thread the following settings for e1000
> > NICs:
> > 
> > "Set the e1000s InterruptThrottleRate=1 and their TxRingBufferSize=4096
> > and RxRingBufferSize=4096 (verify those option names with a modinfo)
> > and add those to modprobe.conf."
> We did try playing with the ring buffer but to no avail.  Modinfo does
> not seem to display the current settings.  We did try playing with
> setting the InterruptThrottleRate to 1 but again to no avail.  As I'll
> mention later, I suspect the issue might be the opensolaris based
> target.

Could be..

> > 
> > > I would love to use larger block sizes as you suggest in your other
> > > email but, on AMD64, I believe we are stuck with 4KB.  I've not seen any
> > > way to change it and would gladly do so if someone knows how.
> > > 
> > 
> > Are we talking about filesystem block sizes? That shouldn't be a problem if
> > your application uses larger blocksizes for read/write operations.. 
> > 
> Yes, file system block size.  When we try rough, end user style tests,
> e.g., large file copies, we seem to get the performance indicated by 4KB
> blocks, i.e., lousy!

Yep.. try upgrading to 10 Gbit Ethernet for much lower latency ;)

> > Try for example with:
> > dd if=/dev/zero of=/iscsilun/file.bin bs=1024k count=1024
> Large block sizes can make the system truly fly so we suspect you are
> absolutely correct about latency being the issue.  We did do our testing
> with raw interfaces by the way.

Ok.

> <snip>
> I did a little digging and calculating and here is what I came up with
> and sent to Nexenta.  Please tell me if I am on the right track.
> 
> I am using jumbo frames and should be able to get 2 4KB blocks
> per frame.  Total size should be 8192 + 78 (TCP + IP + Ethernet + CRC
> -oops we need to add iSCSI -what size is the iSCSI header?) + 12
> (interframe gap) = 8282 bytes.  Transmission latency should be 8282 *
> 8 / 1,000,000,000 = 66.3 micro-seconds.  Switch latency is 5.7
> microseconds so let's say network latency is 72 - well let's say 75
> micro-seconds.  The only additional latency should be added by the
> network stacks on the target and initiator.
> 
> Current round trip latency between the initiator (Linux) and target
> (Nexenta) is around 400 micro-seconds and fluctuates significantly:
> 
> Hmm . .  this is worse than the last test:
> PING 172.30.13.158 (172.30.13.158) 8192(8220) bytes of data.

> --- 172.30.13.158 ping statistics ---
> 33 packets transmitted, 33 received, 0% packet loss, time 32000ms
> rtt min/avg/max/mdev = 0.399/0.574/1.366/0.161 ms
> 
> There is nothing going on in the network.  So we are seeing 574
> micro-seconds total with only 150 micro-seconds attributed to
> transmission.  And we see a wide variation in latency.
>

Yeah something wrong there.. How much latency do you have between different
initiator machines? 
 
> I then tested the latency between interfaces on the initiator and the
> target.  Here is what I get for internal latency on the Linux initiator:
> PING 172.30.13.18 (172.30.13.18) from 172.30.13.146 : 8192(8220) bytes
> of data.
> --- 172.30.13.18 ping statistics ---
> 29 packets transmitted, 29 received, 0% packet loss, time 27999ms
> rtt min/avg/max/mdev = 0.017/0.018/0.033/0.005 ms
> 
> A very consistent 18 micro-seconds.
> 

Yeah, I take it that's not through network/switch :) 

> Here is what I get from the Z200:
> root@disk01:/etc# ping -s -i e1000g6 172.30.13.190 4096
> PING 172.30.13.190: 4096 data bytes
> ----172.30.13.190 PING Statistics----
> 31 packets transmitted, 31 packets received, 0% packet loss
> round-trip (ms)  min/avg/max/stddev = 0.042/0.066/0.104/0.019
> 

Big difference.. I'm not familiar with Solaris, so can't really suggest what
to tune there.. 

> Notice it is several times longer latency with much wider variation.
> How to we tune the opensolaris network stack to reduce it's latency? I'd
> really like to improve the individual user experience.  I can tell them
> it's like commuting to work on the train instead of the car during rush
> hour - faster when there's lots of traffic but slower when there is not,
> but they will judge the product by their individual experiences more
> than their collective experiences.  Thus, I really want to improve the
> individual disk operation throughput.
> 
> Latency seems to be our key.  If I can add only 20 micro-seconds of
> latency from initiator and target each, that would be roughly 200 micro
> seconds.  That would almost triple the throughput from what we are
> currently seeing.
> 

Indeed :) 

> Unfortunately, I'm a bit ignorant of tweaking networks on opensolaris.
> I can certainly learn but am I headed in the right direction or is this
> direction of investigation misguided? Thanks - John
> 

Low latency is the key for good (iSCSI) SAN performance, as it directly
gives you more (possible) IOPS. 

Other option is to configure software/settings so that there are multiple
outstanding IO's on the fly.. then you're not limited with the latency (so much).

-- Pasi

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Shell Scripts or Arbitrary Priority Callouts?
  2009-03-24 18:17                           ` Pasi Kärkkäinen
@ 2009-03-25  3:41                             ` John A. Sullivan III
  2009-03-25 15:52                               ` Pasi Kärkkäinen
  2009-03-25  3:44                             ` John A. Sullivan III
  1 sibling, 1 reply; 34+ messages in thread
From: John A. Sullivan III @ 2009-03-25  3:41 UTC (permalink / raw)
  To: device-mapper development; +Cc: Ross S. W. Walker

On Tue, 2009-03-24 at 20:17 +0200, Pasi Kärkkäinen wrote:
> On Tue, Mar 24, 2009 at 01:30:10PM -0400, John A. Sullivan III wrote:
> > Thanks very much, again, and, again, I'll reply in the text - John
> > 
> 
> Np :)
> 
> > > 
> > > iirc 2810 does not have very big buffers per port, so you might be better
> > > using flow control instead of jumbos.. then again I'm not sure how good flow
> > > control implementation HP has? 
> > > 
> > > The whole point of flow control is to prevent packet loss/drop.. this happens
> > > with sending pause frames before the port buffers get full. If port buffers
> > > get full then the switch doesn't have any other option than to drop the
> > > packets.. and this causes tcp-retransmits -> causes delay and tcp slows down
> > > to prevent further packet drops.
> > > 
> > > flow control "pause frames" cause less delay than tcp-retransmits. 
> > > 
> > > Do you see tcp retransmits with "netstat -s" ? Check both the target and the initiators.
> > Thankfully this is an area of some expertise for me (unlike disk I/O -
> > obviously ;)  ).  We have been pretty thorough about checking the
> > network path.  We've not seen any upper layer retransmission or buffer
> > overflows.
> 
> Good :)
> 
> > > > > > > What kind of performance do you get using just a single iscsi session (and
> > > > > > > thus just a single path), no multipathing, no DM RAID0 ? Just a filesystem
> > > > > > > directly on top of the iscsi /dev/sd? device.
> > > > > > Miserable - same roughly 12 MB/s.
> > > > > 
> > > > > OK, Here's your problem. Was this btw reads or writes? Did you tune
> > > > > readahead-settings? 
> > > > 12MBps is sequential reading but sequential writing is not much
> > > > different.  We did tweak readahead to 1024. We did not want to go much
> > > > larger in order to maintain balance with the various data patterns -
> > > > some of which are random and some of which may not read linearly.
> > > 
> > > I did some benchmarking earlier between two servers; other one running ietd
> > > target with 'nullio' and other running open-iscsi initiator. Both using a single gigabit NIC. 
> > > 
> > > I remember getting very close to full gigabit speed at least with bigger
> > > block sizes. I can't remember how much I got with 4 kB blocks. 
> > > 
> > > Those tests were made with dd.
> > Yes, if we use 64KB blocks, we can saturate a Gig link.  With larger
> > sizes, we can push over 3 Gpbs over the four gig links in the test
> > environment.
> 
> That's good. 
> 
> > > 
> > > nullio target is a good way to benchmark your network and initiator and
> > > verify everything is correct. 
> > > 
> > > Also it's good to first test for example with FTP and Iperf to verify
> > > network is working properly between target and the initiator and all the
> > > other basic settings are correct.
> > We did flood ping the network and had all interfaces operating at near
> > capacity.  The network itself looks very healthy.
> 
> Ok. 
> 
> > > 
> > > Btw have you configured tcp stacks of the servers? Bigger default tcp window
> > > size, bigger maximun tcp window size etc.. 
> > Yep, tweaked transmit queue length, receive and transmit windows, net
> > device backlogs, buffer space, disabled nagle, and even played with the
> > dirty page watermarks.
> 
> That's all taken care of then :) 
> 
> Also on the target? 
> 
> > > 
> > > > > 
> > > > > Can paste your iSCSI session settings negotiated with the target? 
> > > > Pardon my ignorance :( but, other than packet traces, how do I show the
> > > > final negotiated settings?
> > > 
> > > Try:
> > > 
> > > iscsiadm -i -m session
> > > iscsiadm -m session -P3
> > > 
> > Here's what it says.  Pretty much as expected.  We are using COMSTAR on
> > the target and took some traces to see what COMSTAR was expecting. We
> > set the open-iscsi parameters to match:
> > 
> > Current Portal: 172.x.x.174:3260,2
> >         Persistent Portal: 172.x.x.174:3260,2
> >                 **********
> >                 Interface:
> >                 **********
> >                 Iface Name: default
> >                 Iface Transport: tcp
> >                 Iface Initiatorname: iqn.2008-05.biz.ssi:vd-gen
> >                 Iface IPaddress: 172.x.x.162
> >                 Iface HWaddress: default
> >                 Iface Netdev: default
> >                 SID: 32
> >                 iSCSI Connection State: LOGGED IN
> >                 iSCSI Session State: LOGGED_IN
> >                 Internal iscsid Session State: NO CHANGE
> >                 ************************
> >                 Negotiated iSCSI params:
> >                 ************************
> >                 HeaderDigest: None
> >                 DataDigest: None
> >                 MaxRecvDataSegmentLength: 131072
> >                 MaxXmitDataSegmentLength: 8192
> >                 FirstBurstLength: 65536
> >                 MaxBurstLength: 524288
> >                 ImmediateData: Yes
> >                 InitialR2T: Yes
> 
> I guess InitialR2T could be No for a bit better performance? 
> 
> MaxXmitDataSegmentLength looks small? 
> 
> > > > > You should be able to get many times the throughput you get now.. just with
> > > > > a single path/session.
> > > > > 
> > > > > What kind of latency do you have from the initiator to the target/storage? 
> > > > > 
> > > > > Try with for example 4 kB ping:
> > > > > ping -s 4096 <ip_of_the_iscsi_target>
> > > > We have about 400 micro seconds - that seems a bit high :(
> > > > rtt min/avg/max/mdev = 0.275/0.337/0.398/0.047 ms
> > > > 
> > > 
> > > Yeah.. that's a bit high. 
> > Actually, with more testing, we're seeing it stretch up to over 700
> > micro-seconds.  I'll attach a raft of data I collected at the end of
> > this email.
> 
> Ok.
> 
> > > I think Ross suggested in some other thread the following settings for e1000
> > > NICs:
> > > 
> > > "Set the e1000s InterruptThrottleRate=1 and their TxRingBufferSize=4096
> > > and RxRingBufferSize=4096 (verify those option names with a modinfo)
> > > and add those to modprobe.conf."
> > We did try playing with the ring buffer but to no avail.  Modinfo does
> > not seem to display the current settings.  We did try playing with
> > setting the InterruptThrottleRate to 1 but again to no avail.  As I'll
> > mention later, I suspect the issue might be the opensolaris based
> > target.
> 
> Could be..
> 
> > > 
> > > > I would love to use larger block sizes as you suggest in your other
> > > > email but, on AMD64, I believe we are stuck with 4KB.  I've not seen any
> > > > way to change it and would gladly do so if someone knows how.
> > > > 
> > > 
> > > Are we talking about filesystem block sizes? That shouldn't be a problem if
> > > your application uses larger blocksizes for read/write operations.. 
> > > 
> > Yes, file system block size.  When we try rough, end user style tests,
> > e.g., large file copies, we seem to get the performance indicated by 4KB
> > blocks, i.e., lousy!
> 
> Yep.. try upgrading to 10 Gbit Ethernet for much lower latency ;)
> 
> > > Try for example with:
> > > dd if=/dev/zero of=/iscsilun/file.bin bs=1024k count=1024
> > Large block sizes can make the system truly fly so we suspect you are
> > absolutely correct about latency being the issue.  We did do our testing
> > with raw interfaces by the way.
> 
> Ok.
> 
> > <snip>
> > I did a little digging and calculating and here is what I came up with
> > and sent to Nexenta.  Please tell me if I am on the right track.
> > 
> > I am using jumbo frames and should be able to get 2 4KB blocks
> > per frame.  Total size should be 8192 + 78 (TCP + IP + Ethernet + CRC
> > -oops we need to add iSCSI -what size is the iSCSI header?) + 12
> > (interframe gap) = 8282 bytes.  Transmission latency should be 8282 *
> > 8 / 1,000,000,000 = 66.3 micro-seconds.  Switch latency is 5.7
> > microseconds so let's say network latency is 72 - well let's say 75
> > micro-seconds.  The only additional latency should be added by the
> > network stacks on the target and initiator.
> > 
> > Current round trip latency between the initiator (Linux) and target
> > (Nexenta) is around 400 micro-seconds and fluctuates significantly:
> > 
> > Hmm . .  this is worse than the last test:
> > PING 172.30.13.158 (172.30.13.158) 8192(8220) bytes of data.
> 
> > --- 172.30.13.158 ping statistics ---
> > 33 packets transmitted, 33 received, 0% packet loss, time 32000ms
> > rtt min/avg/max/mdev = 0.399/0.574/1.366/0.161 ms
> > 
> > There is nothing going on in the network.  So we are seeing 574
> > micro-seconds total with only 150 micro-seconds attributed to
> > transmission.  And we see a wide variation in latency.
> >
> 
> Yeah something wrong there.. How much latency do you have between different
> initiator machines? 
>  
> > I then tested the latency between interfaces on the initiator and the
> > target.  Here is what I get for internal latency on the Linux initiator:
> > PING 172.30.13.18 (172.30.13.18) from 172.30.13.146 : 8192(8220) bytes
> > of data.
> > --- 172.30.13.18 ping statistics ---
> > 29 packets transmitted, 29 received, 0% packet loss, time 27999ms
> > rtt min/avg/max/mdev = 0.017/0.018/0.033/0.005 ms
> > 
> > A very consistent 18 micro-seconds.
> > 
> 
> Yeah, I take it that's not through network/switch :) 
> 
> > Here is what I get from the Z200:
> > root@disk01:/etc# ping -s -i e1000g6 172.30.13.190 4096
> > PING 172.30.13.190: 4096 data bytes
> > ----172.30.13.190 PING Statistics----
> > 31 packets transmitted, 31 packets received, 0% packet loss
> > round-trip (ms)  min/avg/max/stddev = 0.042/0.066/0.104/0.019
> > 
> 
> Big difference.. I'm not familiar with Solaris, so can't really suggest what
> to tune there.. 
> 
> > Notice it is several times longer latency with much wider variation.
> > How to we tune the opensolaris network stack to reduce it's latency? I'd
> > really like to improve the individual user experience.  I can tell them
> > it's like commuting to work on the train instead of the car during rush
> > hour - faster when there's lots of traffic but slower when there is not,
> > but they will judge the product by their individual experiences more
> > than their collective experiences.  Thus, I really want to improve the
> > individual disk operation throughput.
> > 
> > Latency seems to be our key.  If I can add only 20 micro-seconds of
> > latency from initiator and target each, that would be roughly 200 micro
> > seconds.  That would almost triple the throughput from what we are
> > currently seeing.
> > 
> 
> Indeed :) 
> 
> > Unfortunately, I'm a bit ignorant of tweaking networks on opensolaris.
> > I can certainly learn but am I headed in the right direction or is this
> > direction of investigation misguided? Thanks - John
> > 
> 
> Low latency is the key for good (iSCSI) SAN performance, as it directly
> gives you more (possible) IOPS. 
> 
> Other option is to configure software/settings so that there are multiple
> outstanding IO's on the fly.. then you're not limited with the latency (so much).
> 
> -- Pasi
<snip>
Ross has been of enormous help offline.  Indeed, disabling jumbo packets
produced an almost 50% increase in single threaded throughput.  We are
pretty well set although still a bit disappointed in the latency we are
seeing in opensolaris and have escalated to the vendor about addressing
it.

The once piece which is still a mystery is why using four targets on
four separate interfaces striped with dmadm RAID0 does not produce an
aggregate of slightly less than four times the IOPS of a single target
on a single interface. This would not seem to be the out of order SCSI
command problem of multipath.  One of life's great mysteries yet to be
revealed.  Thanks again, all - John
-- 
John A. Sullivan III
Open Source Development Corporation
+1 207-985-7880
jsullivan@opensourcedevel.com

http://www.spiritualoutreach.com
Making Christianity intelligible to secular society

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Shell Scripts or Arbitrary Priority Callouts?
  2009-03-24 18:17                           ` Pasi Kärkkäinen
  2009-03-25  3:41                             ` John A. Sullivan III
@ 2009-03-25  3:44                             ` John A. Sullivan III
  2009-03-25 15:52                               ` Pasi Kärkkäinen
  1 sibling, 1 reply; 34+ messages in thread
From: John A. Sullivan III @ 2009-03-25  3:44 UTC (permalink / raw)
  To: device-mapper development; +Cc: Ross S. W. Walker

On Tue, 2009-03-24 at 20:17 +0200, Pasi Kärkkäinen wrote:
> On Tue, Mar 24, 2009 at 01:30:10PM -0400, John A. Sullivan III wrote:
> > Thanks very much, again, and, again, I'll reply in the text - John
> > 
> 
> Np :)
> 
> > > 
> > > iirc 2810 does not have very big buffers per port, so you might be better
> > > using flow control instead of jumbos.. then again I'm not sure how good flow
> > > control implementation HP has? 
> > > 
> > > The whole point of flow control is to prevent packet loss/drop.. this happens
> > > with sending pause frames before the port buffers get full. If port buffers
> > > get full then the switch doesn't have any other option than to drop the
> > > packets.. and this causes tcp-retransmits -> causes delay and tcp slows down
> > > to prevent further packet drops.
> > > 
> > > flow control "pause frames" cause less delay than tcp-retransmits. 
> > > 
> > > Do you see tcp retransmits with "netstat -s" ? Check both the target and the initiators.
> > Thankfully this is an area of some expertise for me (unlike disk I/O -
> > obviously ;)  ).  We have been pretty thorough about checking the
> > network path.  We've not seen any upper layer retransmission or buffer
> > overflows.
> 
> Good :)
> 
> > > > > > > What kind of performance do you get using just a single iscsi session (and
> > > > > > > thus just a single path), no multipathing, no DM RAID0 ? Just a filesystem
> > > > > > > directly on top of the iscsi /dev/sd? device.
> > > > > > Miserable - same roughly 12 MB/s.
> > > > > 
> > > > > OK, Here's your problem. Was this btw reads or writes? Did you tune
> > > > > readahead-settings? 
> > > > 12MBps is sequential reading but sequential writing is not much
> > > > different.  We did tweak readahead to 1024. We did not want to go much
> > > > larger in order to maintain balance with the various data patterns -
> > > > some of which are random and some of which may not read linearly.
> > > 
> > > I did some benchmarking earlier between two servers; other one running ietd
> > > target with 'nullio' and other running open-iscsi initiator. Both using a single gigabit NIC. 
> > > 
> > > I remember getting very close to full gigabit speed at least with bigger
> > > block sizes. I can't remember how much I got with 4 kB blocks. 
> > > 
> > > Those tests were made with dd.
> > Yes, if we use 64KB blocks, we can saturate a Gig link.  With larger
> > sizes, we can push over 3 Gpbs over the four gig links in the test
> > environment.
> 
> That's good. 
> 
> > > 
> > > nullio target is a good way to benchmark your network and initiator and
> > > verify everything is correct. 
> > > 
> > > Also it's good to first test for example with FTP and Iperf to verify
> > > network is working properly between target and the initiator and all the
> > > other basic settings are correct.
> > We did flood ping the network and had all interfaces operating at near
> > capacity.  The network itself looks very healthy.
> 
> Ok. 
> 
> > > 
> > > Btw have you configured tcp stacks of the servers? Bigger default tcp window
> > > size, bigger maximun tcp window size etc.. 
> > Yep, tweaked transmit queue length, receive and transmit windows, net
> > device backlogs, buffer space, disabled nagle, and even played with the
> > dirty page watermarks.
> 
> That's all taken care of then :) 
> 
> Also on the target? 
> 
> > > 
> > > > > 
> > > > > Can paste your iSCSI session settings negotiated with the target? 
> > > > Pardon my ignorance :( but, other than packet traces, how do I show the
> > > > final negotiated settings?
> > > 
> > > Try:
> > > 
> > > iscsiadm -i -m session
> > > iscsiadm -m session -P3
> > > 
> > Here's what it says.  Pretty much as expected.  We are using COMSTAR on
> > the target and took some traces to see what COMSTAR was expecting. We
> > set the open-iscsi parameters to match:
> > 
> > Current Portal: 172.x.x.174:3260,2
> >         Persistent Portal: 172.x.x.174:3260,2
> >                 **********
> >                 Interface:
> >                 **********
> >                 Iface Name: default
> >                 Iface Transport: tcp
> >                 Iface Initiatorname: iqn.2008-05.biz.ssi:vd-gen
> >                 Iface IPaddress: 172.x.x.162
> >                 Iface HWaddress: default
> >                 Iface Netdev: default
> >                 SID: 32
> >                 iSCSI Connection State: LOGGED IN
> >                 iSCSI Session State: LOGGED_IN
> >                 Internal iscsid Session State: NO CHANGE
> >                 ************************
> >                 Negotiated iSCSI params:
> >                 ************************
> >                 HeaderDigest: None
> >                 DataDigest: None
> >                 MaxRecvDataSegmentLength: 131072
> >                 MaxXmitDataSegmentLength: 8192
> >                 FirstBurstLength: 65536
> >                 MaxBurstLength: 524288
> >                 ImmediateData: Yes
> >                 InitialR2T: Yes
> 
> I guess InitialR2T could be No for a bit better performance? 
> 
> MaxXmitDataSegmentLength looks small? 
> 
> > > > > You should be able to get many times the throughput you get now.. just with
> > > > > a single path/session.
> > > > > 
> > > > > What kind of latency do you have from the initiator to the target/storage? 
> > > > > 
> > > > > Try with for example 4 kB ping:
> > > > > ping -s 4096 <ip_of_the_iscsi_target>
> > > > We have about 400 micro seconds - that seems a bit high :(
> > > > rtt min/avg/max/mdev = 0.275/0.337/0.398/0.047 ms
> > > > 
> > > 
> > > Yeah.. that's a bit high. 
> > Actually, with more testing, we're seeing it stretch up to over 700
> > micro-seconds.  I'll attach a raft of data I collected at the end of
> > this email.
> 
> Ok.
> 
> > > I think Ross suggested in some other thread the following settings for e1000
> > > NICs:
> > > 
> > > "Set the e1000s InterruptThrottleRate=1 and their TxRingBufferSize=4096
> > > and RxRingBufferSize=4096 (verify those option names with a modinfo)
> > > and add those to modprobe.conf."
> > We did try playing with the ring buffer but to no avail.  Modinfo does
> > not seem to display the current settings.  We did try playing with
> > setting the InterruptThrottleRate to 1 but again to no avail.  As I'll
> > mention later, I suspect the issue might be the opensolaris based
> > target.
> 
> Could be..
> 
> > > 
> > > > I would love to use larger block sizes as you suggest in your other
> > > > email but, on AMD64, I believe we are stuck with 4KB.  I've not seen any
> > > > way to change it and would gladly do so if someone knows how.
> > > > 
> > > 
> > > Are we talking about filesystem block sizes? That shouldn't be a problem if
> > > your application uses larger blocksizes for read/write operations.. 
> > > 
> > Yes, file system block size.  When we try rough, end user style tests,
> > e.g., large file copies, we seem to get the performance indicated by 4KB
> > blocks, i.e., lousy!
> 
> Yep.. try upgrading to 10 Gbit Ethernet for much lower latency ;)
> 
> > > Try for example with:
> > > dd if=/dev/zero of=/iscsilun/file.bin bs=1024k count=1024
> > Large block sizes can make the system truly fly so we suspect you are
> > absolutely correct about latency being the issue.  We did do our testing
> > with raw interfaces by the way.
> 
> Ok.
> 
> > <snip>
> > I did a little digging and calculating and here is what I came up with
> > and sent to Nexenta.  Please tell me if I am on the right track.
> > 
> > I am using jumbo frames and should be able to get 2 4KB blocks
> > per frame.  Total size should be 8192 + 78 (TCP + IP + Ethernet + CRC
> > -oops we need to add iSCSI -what size is the iSCSI header?) + 12
> > (interframe gap) = 8282 bytes.  Transmission latency should be 8282 *
> > 8 / 1,000,000,000 = 66.3 micro-seconds.  Switch latency is 5.7
> > microseconds so let's say network latency is 72 - well let's say 75
> > micro-seconds.  The only additional latency should be added by the
> > network stacks on the target and initiator.
> > 
> > Current round trip latency between the initiator (Linux) and target
> > (Nexenta) is around 400 micro-seconds and fluctuates significantly:
> > 
> > Hmm . .  this is worse than the last test:
> > PING 172.30.13.158 (172.30.13.158) 8192(8220) bytes of data.
> 
> > --- 172.30.13.158 ping statistics ---
> > 33 packets transmitted, 33 received, 0% packet loss, time 32000ms
> > rtt min/avg/max/mdev = 0.399/0.574/1.366/0.161 ms
> > 
> > There is nothing going on in the network.  So we are seeing 574
> > micro-seconds total with only 150 micro-seconds attributed to
> > transmission.  And we see a wide variation in latency.
> >
> 
> Yeah something wrong there.. How much latency do you have between different
> initiator machines? 
>  
> > I then tested the latency between interfaces on the initiator and the
> > target.  Here is what I get for internal latency on the Linux initiator:
> > PING 172.30.13.18 (172.30.13.18) from 172.30.13.146 : 8192(8220) bytes
> > of data.
> > --- 172.30.13.18 ping statistics ---
> > 29 packets transmitted, 29 received, 0% packet loss, time 27999ms
> > rtt min/avg/max/mdev = 0.017/0.018/0.033/0.005 ms
> > 
> > A very consistent 18 micro-seconds.
> > 
> 
> Yeah, I take it that's not through network/switch :) 
> 
> > Here is what I get from the Z200:
> > root@disk01:/etc# ping -s -i e1000g6 172.30.13.190 4096
> > PING 172.30.13.190: 4096 data bytes
> > ----172.30.13.190 PING Statistics----
> > 31 packets transmitted, 31 packets received, 0% packet loss
> > round-trip (ms)  min/avg/max/stddev = 0.042/0.066/0.104/0.019
> > 
> 
> Big difference.. I'm not familiar with Solaris, so can't really suggest what
> to tune there.. 
> 
> > Notice it is several times longer latency with much wider variation.
> > How to we tune the opensolaris network stack to reduce it's latency? I'd
> > really like to improve the individual user experience.  I can tell them
> > it's like commuting to work on the train instead of the car during rush
> > hour - faster when there's lots of traffic but slower when there is not,
> > but they will judge the product by their individual experiences more
> > than their collective experiences.  Thus, I really want to improve the
> > individual disk operation throughput.
> > 
> > Latency seems to be our key.  If I can add only 20 micro-seconds of
> > latency from initiator and target each, that would be roughly 200 micro
> > seconds.  That would almost triple the throughput from what we are
> > currently seeing.
> > 
> 
> Indeed :) 
> 
> > Unfortunately, I'm a bit ignorant of tweaking networks on opensolaris.
> > I can certainly learn but am I headed in the right direction or is this
> > direction of investigation misguided? Thanks - John
> > 
> 
> Low latency is the key for good (iSCSI) SAN performance, as it directly
> gives you more (possible) IOPS. 
> 
> Other option is to configure software/settings so that there are multiple
> outstanding IO's on the fly.. then you're not limited with the latency (so much).
> 
> -- Pasi
<snip>
Ah, there is one more question. If latency is such an issue, as it has
proved to be, would it improve performance to put the file system
journal on local disk rather than the iSCSI disks? - John
-- 
John A. Sullivan III
Open Source Development Corporation
+1 207-985-7880
jsullivan@opensourcedevel.com

http://www.spiritualoutreach.com
Making Christianity intelligible to secular society

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Shell Scripts or Arbitrary Priority Callouts?
  2009-03-25  3:41                             ` John A. Sullivan III
@ 2009-03-25 15:52                               ` Pasi Kärkkäinen
  2009-03-25 16:21                                 ` John A. Sullivan III
  0 siblings, 1 reply; 34+ messages in thread
From: Pasi Kärkkäinen @ 2009-03-25 15:52 UTC (permalink / raw)
  To: device-mapper development; +Cc: Ross S. W. Walker

On Tue, Mar 24, 2009 at 11:41:00PM -0400, John A. Sullivan III wrote:
> > > Latency seems to be our key.  If I can add only 20 micro-seconds of
> > > latency from initiator and target each, that would be roughly 200 micro
> > > seconds.  That would almost triple the throughput from what we are
> > > currently seeing.
> > > 
> > 
> > Indeed :) 
> > 
> > > Unfortunately, I'm a bit ignorant of tweaking networks on opensolaris.
> > > I can certainly learn but am I headed in the right direction or is this
> > > direction of investigation misguided? Thanks - John
> > > 
> > 
> > Low latency is the key for good (iSCSI) SAN performance, as it directly
> > gives you more (possible) IOPS. 
> > 
> > Other option is to configure software/settings so that there are multiple
> > outstanding IO's on the fly.. then you're not limited with the latency (so much).
> > 
> > -- Pasi
> <snip>
> Ross has been of enormous help offline.  Indeed, disabling jumbo packets
> produced an almost 50% increase in single threaded throughput.  We are
> pretty well set although still a bit disappointed in the latency we are
> seeing in opensolaris and have escalated to the vendor about addressing
> it.
> 

Ok. That's pretty big increase. Did you figure out why that happens? 

> The once piece which is still a mystery is why using four targets on
> four separate interfaces striped with dmadm RAID0 does not produce an
> aggregate of slightly less than four times the IOPS of a single target
> on a single interface. This would not seem to be the out of order SCSI
> command problem of multipath.  One of life's great mysteries yet to be
> revealed.  Thanks again, all - John

Hmm.. maybe the out-of-order problem happens at the target? It gets IO
requests to nearby offsets from 4 different sessions and there's some kind
of locking or so going on? 

Just guessing. 

-- Pasi

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Shell Scripts or Arbitrary Priority Callouts?
  2009-03-25  3:44                             ` John A. Sullivan III
@ 2009-03-25 15:52                               ` Pasi Kärkkäinen
  2009-03-25 16:19                                 ` John A. Sullivan III
  0 siblings, 1 reply; 34+ messages in thread
From: Pasi Kärkkäinen @ 2009-03-25 15:52 UTC (permalink / raw)
  To: device-mapper development; +Cc: Ross S. W. Walker

On Tue, Mar 24, 2009 at 11:44:52PM -0400, John A. Sullivan III wrote:
> <snip>
> Ah, there is one more question. If latency is such an issue, as it has
> proved to be, would it improve performance to put the file system
> journal on local disk rather than the iSCSI disks? - John

I have never tried this.. so can't help with that unfortunately. 

Try it? :)

-- Pasi

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Shell Scripts or Arbitrary Priority Callouts?
  2009-03-25 15:52                               ` Pasi Kärkkäinen
@ 2009-03-25 16:19                                 ` John A. Sullivan III
  0 siblings, 0 replies; 34+ messages in thread
From: John A. Sullivan III @ 2009-03-25 16:19 UTC (permalink / raw)
  To: device-mapper development; +Cc: Ross S. W. Walker

On Wed, 2009-03-25 at 17:52 +0200, Pasi Kärkkäinen wrote:
> On Tue, Mar 24, 2009 at 11:44:52PM -0400, John A. Sullivan III wrote:
> > <snip>
> > Ah, there is one more question. If latency is such an issue, as it has
> > proved to be, would it improve performance to put the file system
> > journal on local disk rather than the iSCSI disks? - John
> 
> I have never tried this.. so can't help with that unfortunately. 
> 
> Try it? :)
> 
> -- Pasi
<snip>
Ross was, once again, most helpful here and mentioned he has tried it
and it is a bad idea.  It can apparently cause problems if there is a
network disconnect - John
-- 
John A. Sullivan III
Open Source Development Corporation
+1 207-985-7880
jsullivan@opensourcedevel.com

http://www.spiritualoutreach.com
Making Christianity intelligible to secular society

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Shell Scripts or Arbitrary Priority Callouts?
  2009-03-25 15:52                               ` Pasi Kärkkäinen
@ 2009-03-25 16:21                                 ` John A. Sullivan III
  2009-03-27  7:03                                   ` John A. Sullivan III
  0 siblings, 1 reply; 34+ messages in thread
From: John A. Sullivan III @ 2009-03-25 16:21 UTC (permalink / raw)
  To: device-mapper development; +Cc: Ross S. W. Walker

On Wed, 2009-03-25 at 17:52 +0200, Pasi Kärkkäinen wrote:
> On Tue, Mar 24, 2009 at 11:41:00PM -0400, John A. Sullivan III wrote:
> > > > Latency seems to be our key.  If I can add only 20 micro-seconds of
> > > > latency from initiator and target each, that would be roughly 200 micro
> > > > seconds.  That would almost triple the throughput from what we are
> > > > currently seeing.
> > > > 
> > > 
> > > Indeed :) 
> > > 
> > > > Unfortunately, I'm a bit ignorant of tweaking networks on opensolaris.
> > > > I can certainly learn but am I headed in the right direction or is this
> > > > direction of investigation misguided? Thanks - John
> > > > 
> > > 
> > > Low latency is the key for good (iSCSI) SAN performance, as it directly
> > > gives you more (possible) IOPS. 
> > > 
> > > Other option is to configure software/settings so that there are multiple
> > > outstanding IO's on the fly.. then you're not limited with the latency (so much).
> > > 
> > > -- Pasi
> > <snip>
> > Ross has been of enormous help offline.  Indeed, disabling jumbo packets
> > produced an almost 50% increase in single threaded throughput.  We are
> > pretty well set although still a bit disappointed in the latency we are
> > seeing in opensolaris and have escalated to the vendor about addressing
> > it.
> > 
> 
> Ok. That's pretty big increase. Did you figure out why that happens? 
Greater latency with jumbo packets.
> 
> > The once piece which is still a mystery is why using four targets on
> > four separate interfaces striped with dmadm RAID0 does not produce an
> > aggregate of slightly less than four times the IOPS of a single target
> > on a single interface. This would not seem to be the out of order SCSI
> > command problem of multipath.  One of life's great mysteries yet to be
> > revealed.  Thanks again, all - John
> 
> Hmm.. maybe the out-of-order problem happens at the target? It gets IO
> requests to nearby offsets from 4 different sessions and there's some kind
> of locking or so going on? 
Ross pointed out a flaw in my test methodology.  By running one I/O at a
time, it was literally doing that - not one full RAID0 I/O but one disk
I/O apparently.  He said to truly test it, I would need to run as many
concurrent I/Os as there were disks in the array.  Thanks - John
> 
> Just guessing. 
> 
> -- Pasi
> 
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel
-- 
John A. Sullivan III
Open Source Development Corporation
+1 207-985-7880
jsullivan@opensourcedevel.com

http://www.spiritualoutreach.com
Making Christianity intelligible to secular society

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Shell Scripts or Arbitrary Priority Callouts?
  2009-03-25 16:21                                 ` John A. Sullivan III
@ 2009-03-27  7:03                                   ` John A. Sullivan III
  2009-03-27  9:02                                     ` Pasi Kärkkäinen
                                                       ` (2 more replies)
  0 siblings, 3 replies; 34+ messages in thread
From: John A. Sullivan III @ 2009-03-27  7:03 UTC (permalink / raw)
  To: device-mapper development; +Cc: Ross S. W. Walker

On Wed, 2009-03-25 at 12:21 -0400, John A. Sullivan III wrote:
> On Wed, 2009-03-25 at 17:52 +0200, Pasi Kärkkäinen wrote:
> > On Tue, Mar 24, 2009 at 11:41:00PM -0400, John A. Sullivan III wrote:
> > > > > Latency seems to be our key.  If I can add only 20 micro-seconds of
> > > > > latency from initiator and target each, that would be roughly 200 micro
> > > > > seconds.  That would almost triple the throughput from what we are
> > > > > currently seeing.
> > > > > 
> > > > 
> > > > Indeed :) 
> > > > 
> > > > > Unfortunately, I'm a bit ignorant of tweaking networks on opensolaris.
> > > > > I can certainly learn but am I headed in the right direction or is this
> > > > > direction of investigation misguided? Thanks - John
> > > > > 
> > > > 
> > > > Low latency is the key for good (iSCSI) SAN performance, as it directly
> > > > gives you more (possible) IOPS. 
> > > > 
> > > > Other option is to configure software/settings so that there are multiple
> > > > outstanding IO's on the fly.. then you're not limited with the latency (so much).
> > > > 
> > > > -- Pasi
> > > <snip>
> > > Ross has been of enormous help offline.  Indeed, disabling jumbo packets
> > > produced an almost 50% increase in single threaded throughput.  We are
> > > pretty well set although still a bit disappointed in the latency we are
> > > seeing in opensolaris and have escalated to the vendor about addressing
> > > it.
> > > 
> > 
> > Ok. That's pretty big increase. Did you figure out why that happens? 
> Greater latency with jumbo packets.
> > 
> > > The once piece which is still a mystery is why using four targets on
> > > four separate interfaces striped with dmadm RAID0 does not produce an
> > > aggregate of slightly less than four times the IOPS of a single target
> > > on a single interface. This would not seem to be the out of order SCSI
> > > command problem of multipath.  One of life's great mysteries yet to be
> > > revealed.  Thanks again, all - John
> > 
> > Hmm.. maybe the out-of-order problem happens at the target? It gets IO
> > requests to nearby offsets from 4 different sessions and there's some kind
> > of locking or so going on? 
> Ross pointed out a flaw in my test methodology.  By running one I/O at a
> time, it was literally doing that - not one full RAID0 I/O but one disk
> I/O apparently.  He said to truly test it, I would need to run as many
> concurrent I/Os as there were disks in the array.  Thanks - John
> ><snip>
Argh!!! This turned out to be alarmingly untrue.  This time, we were
doing some light testing on a different server with two bonded
interfaces in a single bridge (KVM environment) going to the same SAM we
used in our four port test.

For kicks and to prove to ourselves that RAID0 scaled with multiple I/O
as opposed to limiting the test to only single I/O, we tried some actual
file transfers to the SAN mounted in sync mode.  We found concurrently
transferring two identical files to the RAID0 array composed of two
iSCSI attached drives was 57% slower than concurrently transferring the
files to the drives separately. In other words, copying file1 and file2
concurrently to RAID0 took 57% longer than concurrently copying file1 to
disk1 and file2 to disk2.

We then took a little different approach and used disktest.  We ran two
concurrent sessions with -K1.  In one case, we ran both sessions to the
2 disk RAID0 array.  The performance was significantly less again, than
running the two concurrent tests against two separate iSCSI disks.  Just
to be clear, these were the same disks as composed the array, just not
grouped in the array.

Even more alarmingly, we did the same test using multipath multibus,
i.e., two concurrent disktest with -K1 (both reads and rights, all
sequential with 4K block sizes).  The first session completely starved
the second.  The first one continued at only slightly reduced speed
while the second one (kicked off just as fast as we could hit the enter
key) received only roughly 50 IOPS.  Yes, that's fifty.

Frightening but I thought I had better pass along such extreme results
to the multipath team.  Thanks - John
-- 
John A. Sullivan III
Open Source Development Corporation
+1 207-985-7880
jsullivan@opensourcedevel.com

http://www.spiritualoutreach.com
Making Christianity intelligible to secular society

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Shell Scripts or Arbitrary Priority Callouts?
  2009-03-27  7:03                                   ` John A. Sullivan III
@ 2009-03-27  9:02                                     ` Pasi Kärkkäinen
  2009-03-27  9:14                                       ` John A. Sullivan III
       [not found]                                     ` <2A4646CA-45AF-460A-8E43-A6D9C070842A@medallion.com>
  2009-03-27 18:28                                     ` John A. Sullivan III
  2 siblings, 1 reply; 34+ messages in thread
From: Pasi Kärkkäinen @ 2009-03-27  9:02 UTC (permalink / raw)
  To: device-mapper development; +Cc: Ross S. W. Walker

On Fri, Mar 27, 2009 at 03:03:35AM -0400, John A. Sullivan III wrote:
> > > 
> > > > The once piece which is still a mystery is why using four targets on
> > > > four separate interfaces striped with dmadm RAID0 does not produce an
> > > > aggregate of slightly less than four times the IOPS of a single target
> > > > on a single interface. This would not seem to be the out of order SCSI
> > > > command problem of multipath.  One of life's great mysteries yet to be
> > > > revealed.  Thanks again, all - John
> > > 
> > > Hmm.. maybe the out-of-order problem happens at the target? It gets IO
> > > requests to nearby offsets from 4 different sessions and there's some kind
> > > of locking or so going on? 
> > Ross pointed out a flaw in my test methodology.  By running one I/O at a
> > time, it was literally doing that - not one full RAID0 I/O but one disk
> > I/O apparently.  He said to truly test it, I would need to run as many
> > concurrent I/Os as there were disks in the array.  Thanks - John
> > ><snip>
> Argh!!! This turned out to be alarmingly untrue.  This time, we were
> doing some light testing on a different server with two bonded
> interfaces in a single bridge (KVM environment) going to the same SAM we
> used in our four port test.
> 

Is the SAN also using bonded interfaces?

> For kicks and to prove to ourselves that RAID0 scaled with multiple I/O
> as opposed to limiting the test to only single I/O, we tried some actual
> file transfers to the SAN mounted in sync mode.  We found concurrently
> transferring two identical files to the RAID0 array composed of two
> iSCSI attached drives was 57% slower than concurrently transferring the
> files to the drives separately. In other words, copying file1 and file2
> concurrently to RAID0 took 57% longer than concurrently copying file1 to
> disk1 and file2 to disk2.
> 

Hmm.. I wonder why that happens..

> We then took a little different approach and used disktest.  We ran two
> concurrent sessions with -K1.  In one case, we ran both sessions to the
> 2 disk RAID0 array.  The performance was significantly less again, than
> running the two concurrent tests against two separate iSCSI disks.  Just
> to be clear, these were the same disks as composed the array, just not
> grouped in the array.
> 

There has to be some logical explanation to this.. 

> Even more alarmingly, we did the same test using multipath multibus,
> i.e., two concurrent disktest with -K1 (both reads and rights, all
> sequential with 4K block sizes).  The first session completely starved
> the second.  The first one continued at only slightly reduced speed
> while the second one (kicked off just as fast as we could hit the enter
> key) received only roughly 50 IOPS.  Yes, that's fifty.
> 
> Frightening but I thought I had better pass along such extreme results
> to the multipath team.  Thanks - John

Hmm, so you had mpath0 and mpath1, and you ran disktest against both, at the
same time? 

-- Pasi

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Shell Scripts or Arbitrary Priority Callouts?
  2009-03-27  9:02                                     ` Pasi Kärkkäinen
@ 2009-03-27  9:14                                       ` John A. Sullivan III
  0 siblings, 0 replies; 34+ messages in thread
From: John A. Sullivan III @ 2009-03-27  9:14 UTC (permalink / raw)
  To: device-mapper development; +Cc: Ross S. W. Walker

On Fri, 2009-03-27 at 11:02 +0200, Pasi Kärkkäinen wrote:
> On Fri, Mar 27, 2009 at 03:03:35AM -0400, John A. Sullivan III wrote:
> > > > 
> > > > > The once piece which is still a mystery is why using four targets on
> > > > > four separate interfaces striped with dmadm RAID0 does not produce an
> > > > > aggregate of slightly less than four times the IOPS of a single target
> > > > > on a single interface. This would not seem to be the out of order SCSI
> > > > > command problem of multipath.  One of life's great mysteries yet to be
> > > > > revealed.  Thanks again, all - John
> > > > 
> > > > Hmm.. maybe the out-of-order problem happens at the target? It gets IO
> > > > requests to nearby offsets from 4 different sessions and there's some kind
> > > > of locking or so going on? 
> > > Ross pointed out a flaw in my test methodology.  By running one I/O at a
> > > time, it was literally doing that - not one full RAID0 I/O but one disk
> > > I/O apparently.  He said to truly test it, I would need to run as many
> > > concurrent I/Os as there were disks in the array.  Thanks - John
> > > ><snip>
> > Argh!!! This turned out to be alarmingly untrue.  This time, we were
> > doing some light testing on a different server with two bonded
> > interfaces in a single bridge (KVM environment) going to the same SAM we
> > used in our four port test.
> > 
> 
> Is the SAN also using bonded interfaces?
No.
> 
> > For kicks and to prove to ourselves that RAID0 scaled with multiple I/O
> > as opposed to limiting the test to only single I/O, we tried some actual
> > file transfers to the SAN mounted in sync mode.  We found concurrently
> > transferring two identical files to the RAID0 array composed of two
> > iSCSI attached drives was 57% slower than concurrently transferring the
> > files to the drives separately. In other words, copying file1 and file2
> > concurrently to RAID0 took 57% longer than concurrently copying file1 to
> > disk1 and file2 to disk2.
> > 
> 
> Hmm.. I wonder why that happens..
> 
> > We then took a little different approach and used disktest.  We ran two
> > concurrent sessions with -K1.  In one case, we ran both sessions to the
> > 2 disk RAID0 array.  The performance was significantly less again, than
> > running the two concurrent tests against two separate iSCSI disks.  Just
> > to be clear, these were the same disks as composed the array, just not
> > grouped in the array.
> > 
> 
> There has to be some logical explanation to this.. 
> 
> > Even more alarmingly, we did the same test using multipath multibus,
> > i.e., two concurrent disktest with -K1 (both reads and rights, all
> > sequential with 4K block sizes).  The first session completely starved
> > the second.  The first one continued at only slightly reduced speed
> > while the second one (kicked off just as fast as we could hit the enter
> > key) received only roughly 50 IOPS.  Yes, that's fifty.
> > 
> > Frightening but I thought I had better pass along such extreme results
> > to the multipath team.  Thanks - John
> 
> Hmm, so you had mpath0 and mpath1, and you ran disktest against both, at the
> same time? 
I had /dev/mapper/isda (composed of /dev/sdc and /dev/sdd) and ran two
separate but concurrent disktests against /dev/mapper/isda
<snip>
-- 
John A. Sullivan III
Open Source Development Corporation
+1 207-985-7880
jsullivan@opensourcedevel.com

http://www.spiritualoutreach.com
Making Christianity intelligible to secular society

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: Shell Scripts or Arbitrary Priority Callouts?
       [not found]                                         ` <E2BB8074E5500C42984D980D4BD78EF9029E3CFA@MFG-NYC-EXCH2.mfg.prv>
@ 2009-03-27 17:15                                           ` John A. Sullivan III
  2009-03-27 18:06                                             ` John A. Sullivan III
  0 siblings, 1 reply; 34+ messages in thread
From: John A. Sullivan III @ 2009-03-27 17:15 UTC (permalink / raw)
  To: Ross S. W. Walker; +Cc: device-mapper development

On Fri, 2009-03-27 at 10:59 -0400, Ross S. W. Walker wrote:
> John A. Sullivan III wrote:
> > 
> > On Fri, 2009-03-27 at 10:09 -0400, Ross S. W. Walker wrote:
> > > 
> > > I'm sorry you lost me here on this test. NFS file transfers are
> > > completely different then iSCSI as NFS uses the OS' (for better or
> > > worse) implementation of file io and not block io, so using NFS as a
> > > test isn't the best.
> > 
> > Sorry if I wasn't clear.  These were iSCSI tests but using file
> > services, e.g., cp, mv.
> > 
> > > My whole idea of mentioning the multiple ios was basically if one
> > > disk's maximum transfer speed is 40MB/s and you put 4 in a RAID0 then
> > > if you tested that RAID0 with an app that did 1 io at a time you would
> > > still only see 40MB/s and not the 160MB/s it is capable of. Obviously
> > > that isn't what's happening here.
> > 
> > Yes, that's what I was hoping for :)
> > 
> > > One would hope throughput scales up with number of simultaneous ios or
> > > there is something broken. Maybe it is dm-raid that is broken, or
> > > maybe it's just dm-raid+iSCSI that's broken, or maybe it's just
> > > dm-raid+iSCSI+ZFS that's broken? Who knows? Iit's obvious that this
> > > setup won't work though.
> > > 
> <snip>
> > > 
> > > I would take a step back at this point and re-evaluate if running
> > > software RAID over iSCSI is the best. It seems like a highly complex
> > > system to get the performance you need.
> > 
> > Yes, that's what we've decided to do which has forced a bit of redesign
> > on our part.
> > 
> > > The best, most reliably performing results I have seen with iSCSI have
> > > been those where the target is performing the RAID (software or
> > > hardware) and the initiator treats it as a plain disk, using interface
> > > aggregates or multi-paths to gain throughput.
> > 
> > Strangely, multipath didn't seem to help as I thought it would.  Thanks
> > for all your help - John
> 
> Not a problem, in fact I have a vested interest in you succeeding because
> we are looking at moving our iSCSI over from Linux to Solaris to get
> that warm fuzzy ZFS integrity assurance feeling.
> 
> It is just (and we are running Solaris 10u6 and not OpenSolaris here), we
> find the user space iscsitgt daemon leaves something to be desired over
> the kernel based iSCSI Enterprise Target on Linux. Also we have noticed
> strange performance abnormalities in our tests of iSCSI to a ZFS ZVOL
> as opposed to iSCSI to a raw disk.
> 
> Maybe it's just the price you pay doing iSCSI to a COW file system instead
> of a raw disk? I dunno, I always prefer to get it all then compromise,
> but sometimes one doesn't have a choice.
> 
> Oh BTW ZFS as an NFS server works brilliantly. We have moved 50% of our
> VMs over from iSCSI to NFS and performance is great as each VM has 
> a level of guaranteed throughput so no one VM can starve out the others.
> 
> We are going to stick with our Linux iSCSI target for a wee bit longer
> though to see if the ZFS/ZVOL/iscsitgt situation improves in 10u7. Gives
> me time to learn the Solaris LiveUpgrade feature.
<snip>
Hi Ross. I hope you don't mind that I copied the mailing list in case it
is of interest to anyone.

We are using COMSTAR which has given a substantial improvement over
iscsitgt.  I was concerned about the overhead of writing to the file
system although I'm not sure that's really the issue.  When we run
bonnie++ tests on the unit itself, we see it able to read at over 500
MB/s and write at almost 250.  That's in a unit with four mirror
pairs.  

The killer issue seems to be latency.  As seen in a previous post, the
latency on the opensolaris unit is dramatically higher than on the Linux
unit and more erratic.

That brings me back to a question about NFS.  I could use NFS for our
scenario as much of our disk I/O is file system I/O (hence the 4KB block
size problem and impact of latency).  We've thought about switching to
NFS and I've just not benchmarked it because we are so behind on our
project (not a good reason).  However, everything I read says NFS is
slower iSCSI especially for random file I/O as we expect although I
wonder if it would help us on sequential I/O since we can change the I/O
size to something much larger than the file system block size.

So, what are the downsides of using NFS? Will I see the same problem
because of latency (since we get plenty of raw throughput on disk) or is
it a solution to our latency problem because we can tell it to use large
transfer sizes? Thanks - John
> 
-- 
John A. Sullivan III
Open Source Development Corporation
+1 207-985-7880
jsullivan@opensourcedevel.com

http://www.spiritualoutreach.com
Making Christianity intelligible to secular society

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: Shell Scripts or Arbitrary Priority Callouts?
  2009-03-27 17:15                                           ` John A. Sullivan III
@ 2009-03-27 18:06                                             ` John A. Sullivan III
  2009-03-27 18:23                                               ` Ross S. W. Walker
  0 siblings, 1 reply; 34+ messages in thread
From: John A. Sullivan III @ 2009-03-27 18:06 UTC (permalink / raw)
  To: device-mapper development; +Cc: Ross S. W. Walker

On Fri, 2009-03-27 at 13:15 -0400, John A. Sullivan III wrote:
> On Fri, 2009-03-27 at 10:59 -0400, Ross S. W. Walker wrote:
> > John A. Sullivan III wrote:
> > > 
> > > On Fri, 2009-03-27 at 10:09 -0400, Ross S. W. Walker wrote:
> > > > 
> > > > I'm sorry you lost me here on this test. NFS file transfers are
> > > > completely different then iSCSI as NFS uses the OS' (for better or
> > > > worse) implementation of file io and not block io, so using NFS as a
> > > > test isn't the best.
> > > 
> > > Sorry if I wasn't clear.  These were iSCSI tests but using file
> > > services, e.g., cp, mv.
> > > 
> > > > My whole idea of mentioning the multiple ios was basically if one
> > > > disk's maximum transfer speed is 40MB/s and you put 4 in a RAID0 then
> > > > if you tested that RAID0 with an app that did 1 io at a time you would
> > > > still only see 40MB/s and not the 160MB/s it is capable of. Obviously
> > > > that isn't what's happening here.
> > > 
> > > Yes, that's what I was hoping for :)
> > > 
> > > > One would hope throughput scales up with number of simultaneous ios or
> > > > there is something broken. Maybe it is dm-raid that is broken, or
> > > > maybe it's just dm-raid+iSCSI that's broken, or maybe it's just
> > > > dm-raid+iSCSI+ZFS that's broken? Who knows? Iit's obvious that this
> > > > setup won't work though.
> > > > 
> > <snip>
> > > > 
> > > > I would take a step back at this point and re-evaluate if running
> > > > software RAID over iSCSI is the best. It seems like a highly complex
> > > > system to get the performance you need.
> > > 
> > > Yes, that's what we've decided to do which has forced a bit of redesign
> > > on our part.
> > > 
> > > > The best, most reliably performing results I have seen with iSCSI have
> > > > been those where the target is performing the RAID (software or
> > > > hardware) and the initiator treats it as a plain disk, using interface
> > > > aggregates or multi-paths to gain throughput.
> > > 
> > > Strangely, multipath didn't seem to help as I thought it would.  Thanks
> > > for all your help - John
> > 
> > Not a problem, in fact I have a vested interest in you succeeding because
> > we are looking at moving our iSCSI over from Linux to Solaris to get
> > that warm fuzzy ZFS integrity assurance feeling.
> > 
> > It is just (and we are running Solaris 10u6 and not OpenSolaris here), we
> > find the user space iscsitgt daemon leaves something to be desired over
> > the kernel based iSCSI Enterprise Target on Linux. Also we have noticed
> > strange performance abnormalities in our tests of iSCSI to a ZFS ZVOL
> > as opposed to iSCSI to a raw disk.
> > 
> > Maybe it's just the price you pay doing iSCSI to a COW file system instead
> > of a raw disk? I dunno, I always prefer to get it all then compromise,
> > but sometimes one doesn't have a choice.
> > 
> > Oh BTW ZFS as an NFS server works brilliantly. We have moved 50% of our
> > VMs over from iSCSI to NFS and performance is great as each VM has 
> > a level of guaranteed throughput so no one VM can starve out the others.
> > 
> > We are going to stick with our Linux iSCSI target for a wee bit longer
> > though to see if the ZFS/ZVOL/iscsitgt situation improves in 10u7. Gives
> > me time to learn the Solaris LiveUpgrade feature.
> <snip>
> Hi Ross. I hope you don't mind that I copied the mailing list in case it
> is of interest to anyone.
> 
> We are using COMSTAR which has given a substantial improvement over
> iscsitgt.  I was concerned about the overhead of writing to the file
> system although I'm not sure that's really the issue.  When we run
> bonnie++ tests on the unit itself, we see it able to read at over 500
> MB/s and write at almost 250.  That's in a unit with four mirror
> pairs.  
> 
> The killer issue seems to be latency.  As seen in a previous post, the
> latency on the opensolaris unit is dramatically higher than on the Linux
> unit and more erratic.
> 
> That brings me back to a question about NFS.  I could use NFS for our
> scenario as much of our disk I/O is file system I/O (hence the 4KB block
> size problem and impact of latency).  We've thought about switching to
> NFS and I've just not benchmarked it because we are so behind on our
> project (not a good reason).  However, everything I read says NFS is
> slower iSCSI especially for random file I/O as we expect although I
> wonder if it would help us on sequential I/O since we can change the I/O
> size to something much larger than the file system block size.
> 
> So, what are the downsides of using NFS? Will I see the same problem
> because of latency (since we get plenty of raw throughput on disk) or is
> it a solution to our latency problem because we can tell it to use large
> transfer sizes? Thanks - John
> > 
Ah, I forgot another issue in our environment which mitigates again NFS.
It is very heavily virtualized so there are not a lot of physical
devices.  They do have multiple Ethernet interfaces, (e.g., there are
ten GbE interfaces on the SAN).  However, we have a problem aggregating
throughput.  Because there are only a handful of actual (but very large)
systems, Ethernet bonding doesn't help us very much.  Anything that
passes through the ProCurve switches is mapped by MAC address (as
opposed to being able to hash upon socket).  Thus, all traffic flows
through a single interface.

Moreover, opensolaris only seems to support 802.3ad bonding.  Our
ProCurve switches do not support 802.3ad bonding across multiple
switches so using leaves use vulnerable to a single point of failure
(ironically).  I believe Nortel supports this though I'm not sure about
Cisco.  HP is releasing that capability this or next quarter but not in
our lowly 2810 models.  We thought we got around the problems by using
Linux based balance-xor using hash 3+4 and it seemed to work until we
noticed than the switch CPU shot to 99% under even light traffic
loads :(

That's initially why we wanted to go the RAID0 route with each interface
on a different network for aggregating throughput and dm-multipath
underneath it for fault tolerance.  If we go with NFS, I believe we will
lose the ability to aggregate bandwidth.  Thanks - John
-- 
John A. Sullivan III
Open Source Development Corporation
+1 207-985-7880
jsullivan@opensourcedevel.com

http://www.spiritualoutreach.com
Making Christianity intelligible to secular society

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: Shell Scripts or Arbitrary Priority Callouts?
  2009-03-27 18:06                                             ` John A. Sullivan III
@ 2009-03-27 18:23                                               ` Ross S. W. Walker
  0 siblings, 0 replies; 34+ messages in thread
From: Ross S. W. Walker @ 2009-03-27 18:23 UTC (permalink / raw)
  To: John A. Sullivan III, device-mapper development

John A. Sullivan III wrote:
> 
> Ah, I forgot another issue in our environment which mitigates 
> again NFS.
> It is very heavily virtualized so there are not a lot of physical
> devices.  They do have multiple Ethernet interfaces, (e.g., there are
> ten GbE interfaces on the SAN).  However, we have a problem aggregating
> throughput.  Because there are only a handful of actual (but very large)
> systems, Ethernet bonding doesn't help us very much.  Anything that
> passes through the ProCurve switches is mapped by MAC address (as
> opposed to being able to hash upon socket).  Thus, all traffic flows
> through a single interface.

With virtualization it's not about the bandwidth, it's about the random
IOPS. All that random io really does mean that most throughput is
measured in 100s of KB not MB. Most storage will assemble sequential
reads in the read cache on shared storage.

Think of my suggestion of NFS for VM OS configuration and system
images, then inside those VMs use iSCSI for application specific
storage. If running iSCSI inside the VM is too problematic due to
virtual networking issues, you can always have the host system
run the initiator and present the SCSI devices to the VMs. It's
just a PITA when you re-allocate VMs from one host to another.

> Moreover, opensolaris only seems to support 802.3ad bonding.  Our
> ProCurve switches do not support 802.3ad bonding across multiple
> switches so using leaves use vulnerable to a single point of failure
> (ironically).  I believe Nortel supports this though I'm not sure about
> Cisco.  HP is releasing that capability this or next quarter but not in
> our lowly 2810 models.  We thought we got around the problems by using
> Linux based balance-xor using hash 3+4 and it seemed to work until we
> noticed than the switch CPU shot to 99% under even light traffic
> loads :(

Yes, the ProCurves and PowerConnects only use IEEE protocols and
IEEE doesn't have a protocol for splitting a LAG yet :-(

And yes the XOR load-balance will drive your switches to their knees
with gratuitous ARPs. I too found that out the hard way... :-(

> That's initially why we wanted to go the RAID0 route with each interface
> on a different network for aggregating throughput and dm-multipath
> underneath it for fault tolerance.  If we go with NFS, I believe we will
> lose the ability to aggregate bandwidth.  Thanks - John

With NFS you can use 802.3ad, which is, as the scary man in Old Country
for Old Man said, "Is the best offer your going to get", and with Solaris'
IPMP you can protect the NFS SPoF, if you couple that with application
specific iSCSI running over different LAG groups using multi-pathing
then you can probably make use of all that bandwidth.

-Ross

______________________________________________________________________
This e-mail, and any attachments thereto, is intended only for use by
the addressee(s) named herein and may contain legally privileged
and/or confidential information. If you are not the intended recipient
of this e-mail, you are hereby notified that any dissemination,
distribution or copying of this e-mail, and any attachments thereto,
is strictly prohibited. If you have received this e-mail in error,
please immediately notify the sender and permanently delete the
original and any copy or printout thereof.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Shell Scripts or Arbitrary Priority Callouts?
  2009-03-27  7:03                                   ` John A. Sullivan III
  2009-03-27  9:02                                     ` Pasi Kärkkäinen
       [not found]                                     ` <2A4646CA-45AF-460A-8E43-A6D9C070842A@medallion.com>
@ 2009-03-27 18:28                                     ` John A. Sullivan III
  2009-03-29 18:09                                       ` Pasi Kärkkäinen
  2 siblings, 1 reply; 34+ messages in thread
From: John A. Sullivan III @ 2009-03-27 18:28 UTC (permalink / raw)
  To: device-mapper development; +Cc: Ross S. W. Walker

On Fri, 2009-03-27 at 03:03 -0400, John A. Sullivan III wrote:
> On Wed, 2009-03-25 at 12:21 -0400, John A. Sullivan III wrote:
> > On Wed, 2009-03-25 at 17:52 +0200, Pasi Kärkkäinen wrote:
> > > On Tue, Mar 24, 2009 at 11:41:00PM -0400, John A. Sullivan III wrote:
> > > > > > Latency seems to be our key.  If I can add only 20 micro-seconds of
> > > > > > latency from initiator and target each, that would be roughly 200 micro
> > > > > > seconds.  That would almost triple the throughput from what we are
> > > > > > currently seeing.
> > > > > > 
> > > > > 
> > > > > Indeed :) 
> > > > > 
> > > > > > Unfortunately, I'm a bit ignorant of tweaking networks on opensolaris.
> > > > > > I can certainly learn but am I headed in the right direction or is this
> > > > > > direction of investigation misguided? Thanks - John
> > > > > > 
> > > > > 
> > > > > Low latency is the key for good (iSCSI) SAN performance, as it directly
> > > > > gives you more (possible) IOPS. 
> > > > > 
> > > > > Other option is to configure software/settings so that there are multiple
> > > > > outstanding IO's on the fly.. then you're not limited with the latency (so much).
> > > > > 
> > > > > -- Pasi
> > > > <snip>
> > > > Ross has been of enormous help offline.  Indeed, disabling jumbo packets
> > > > produced an almost 50% increase in single threaded throughput.  We are
> > > > pretty well set although still a bit disappointed in the latency we are
> > > > seeing in opensolaris and have escalated to the vendor about addressing
> > > > it.
> > > > 
> > > 
> > > Ok. That's pretty big increase. Did you figure out why that happens? 
> > Greater latency with jumbo packets.
> > > 
> > > > The once piece which is still a mystery is why using four targets on
> > > > four separate interfaces striped with dmadm RAID0 does not produce an
> > > > aggregate of slightly less than four times the IOPS of a single target
> > > > on a single interface. This would not seem to be the out of order SCSI
> > > > command problem of multipath.  One of life's great mysteries yet to be
> > > > revealed.  Thanks again, all - John
> > > 
> > > Hmm.. maybe the out-of-order problem happens at the target? It gets IO
> > > requests to nearby offsets from 4 different sessions and there's some kind
> > > of locking or so going on? 
> > Ross pointed out a flaw in my test methodology.  By running one I/O at a
> > time, it was literally doing that - not one full RAID0 I/O but one disk
> > I/O apparently.  He said to truly test it, I would need to run as many
> > concurrent I/Os as there were disks in the array.  Thanks - John
> > ><snip>
> Argh!!! This turned out to be alarmingly untrue.  This time, we were
> doing some light testing on a different server with two bonded
> interfaces in a single bridge (KVM environment) going to the same SAM we
> used in our four port test.
> 
> For kicks and to prove to ourselves that RAID0 scaled with multiple I/O
> as opposed to limiting the test to only single I/O, we tried some actual
> file transfers to the SAN mounted in sync mode.  We found concurrently
> transferring two identical files to the RAID0 array composed of two
> iSCSI attached drives was 57% slower than concurrently transferring the
> files to the drives separately. In other words, copying file1 and file2
> concurrently to RAID0 took 57% longer than concurrently copying file1 to
> disk1 and file2 to disk2.
> 
> We then took a little different approach and used disktest.  We ran two
> concurrent sessions with -K1.  In one case, we ran both sessions to the
> 2 disk RAID0 array.  The performance was significantly less again, than
> running the two concurrent tests against two separate iSCSI disks.  Just
> to be clear, these were the same disks as composed the array, just not
> grouped in the array.
> 
> Even more alarmingly, we did the same test using multipath multibus,
> i.e., two concurrent disktest with -K1 (both reads and rights, all
> sequential with 4K block sizes).  The first session completely starved
> the second.  The first one continued at only slightly reduced speed
> while the second one (kicked off just as fast as we could hit the enter
> key) received only roughly 50 IOPS.  Yes, that's fifty.
> 
> Frightening but I thought I had better pass along such extreme results
> to the multipath team.  Thanks - John
HOLD THE PRESSES - This turned out to be a DIFFERENT problem.  Argh!
That's what I get for being a management type out of my depth doing
engineering until we hire our engineering staff!

As mentioned, these tests were run on a different, lighter duty system.
When we ran the same tests on the larger, four dedicated SAN port
server, RAID0 scaled nicely showing little degradation between one
thread and four concurrent threads, i.e., our test file transfers took
almost the same when a single user did them as opposed to when four
users did them concurrently.

The problem with our other system was, the RAID (and probably
multi-path) was backfiring because the iSCSI connection was buckling
under any appreciable load because the Ethernet interfaces use bridging.

These are much lighter duty systems and we bought them from the same
vendor as the SAN but with only the two onboard Ethernet ports.  Being
ignorant, we looked to them for design guidance (and they were excellent
in all other regards) and were not cautioned about sharing these
interfaces.  Because these are light duty, we intentionally broke the
cardinal rule of not using a dedicated SAN network for them.  That's not
so much the problem. However, because they are running KVM, the
interfaces are bridged (actually bonded and bridged using tlb as alb
breaks with bridging in its current implementation - but bonding is not
the issue).  Under any appreciable load, the iSCSI connections time out.
We've tried varying the noop time out values but with no success.  We do
not have the time to test rigorously but assume this is why throughput
did not scale at all.  disktest with -K10 achieved the same throughput
as disktest with -K1.  Oh well, the price of tuition.
-- 
John A. Sullivan III
Open Source Development Corporation
+1 207-985-7880
jsullivan@opensourcedevel.com

http://www.spiritualoutreach.com
Making Christianity intelligible to secular society

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Shell Scripts or Arbitrary Priority Callouts?
  2009-03-27 18:28                                     ` John A. Sullivan III
@ 2009-03-29 18:09                                       ` Pasi Kärkkäinen
  2009-03-29 21:22                                         ` John A. Sullivan III
  0 siblings, 1 reply; 34+ messages in thread
From: Pasi Kärkkäinen @ 2009-03-29 18:09 UTC (permalink / raw)
  To: device-mapper development; +Cc: Ross S. W. Walker

On Fri, Mar 27, 2009 at 02:28:45PM -0400, John A. Sullivan III wrote:
> On Fri, 2009-03-27 at 03:03 -0400, John A. Sullivan III wrote:
> > On Wed, 2009-03-25 at 12:21 -0400, John A. Sullivan III wrote:
> > > On Wed, 2009-03-25 at 17:52 +0200, Pasi Kärkkäinen wrote:
> > > > On Tue, Mar 24, 2009 at 11:41:00PM -0400, John A. Sullivan III wrote:
> > > > > > > Latency seems to be our key.  If I can add only 20 micro-seconds of
> > > > > > > latency from initiator and target each, that would be roughly 200 micro
> > > > > > > seconds.  That would almost triple the throughput from what we are
> > > > > > > currently seeing.
> > > > > > > 
> > > > > > 
> > > > > > Indeed :) 
> > > > > > 
> > > > > > > Unfortunately, I'm a bit ignorant of tweaking networks on opensolaris.
> > > > > > > I can certainly learn but am I headed in the right direction or is this
> > > > > > > direction of investigation misguided? Thanks - John
> > > > > > > 
> > > > > > 
> > > > > > Low latency is the key for good (iSCSI) SAN performance, as it directly
> > > > > > gives you more (possible) IOPS. 
> > > > > > 
> > > > > > Other option is to configure software/settings so that there are multiple
> > > > > > outstanding IO's on the fly.. then you're not limited with the latency (so much).
> > > > > > 
> > > > > > -- Pasi
> > > > > <snip>
> > > > > Ross has been of enormous help offline.  Indeed, disabling jumbo packets
> > > > > produced an almost 50% increase in single threaded throughput.  We are
> > > > > pretty well set although still a bit disappointed in the latency we are
> > > > > seeing in opensolaris and have escalated to the vendor about addressing
> > > > > it.
> > > > > 
> > > > 
> > > > Ok. That's pretty big increase. Did you figure out why that happens? 
> > > Greater latency with jumbo packets.
> > > > 
> > > > > The once piece which is still a mystery is why using four targets on
> > > > > four separate interfaces striped with dmadm RAID0 does not produce an
> > > > > aggregate of slightly less than four times the IOPS of a single target
> > > > > on a single interface. This would not seem to be the out of order SCSI
> > > > > command problem of multipath.  One of life's great mysteries yet to be
> > > > > revealed.  Thanks again, all - John
> > > > 
> > > > Hmm.. maybe the out-of-order problem happens at the target? It gets IO
> > > > requests to nearby offsets from 4 different sessions and there's some kind
> > > > of locking or so going on? 
> > > Ross pointed out a flaw in my test methodology.  By running one I/O at a
> > > time, it was literally doing that - not one full RAID0 I/O but one disk
> > > I/O apparently.  He said to truly test it, I would need to run as many
> > > concurrent I/Os as there were disks in the array.  Thanks - John
> > > ><snip>
> > Argh!!! This turned out to be alarmingly untrue.  This time, we were
> > doing some light testing on a different server with two bonded
> > interfaces in a single bridge (KVM environment) going to the same SAM we
> > used in our four port test.
> > 
> > For kicks and to prove to ourselves that RAID0 scaled with multiple I/O
> > as opposed to limiting the test to only single I/O, we tried some actual
> > file transfers to the SAN mounted in sync mode.  We found concurrently
> > transferring two identical files to the RAID0 array composed of two
> > iSCSI attached drives was 57% slower than concurrently transferring the
> > files to the drives separately. In other words, copying file1 and file2
> > concurrently to RAID0 took 57% longer than concurrently copying file1 to
> > disk1 and file2 to disk2.
> > 
> > We then took a little different approach and used disktest.  We ran two
> > concurrent sessions with -K1.  In one case, we ran both sessions to the
> > 2 disk RAID0 array.  The performance was significantly less again, than
> > running the two concurrent tests against two separate iSCSI disks.  Just
> > to be clear, these were the same disks as composed the array, just not
> > grouped in the array.
> > 
> > Even more alarmingly, we did the same test using multipath multibus,
> > i.e., two concurrent disktest with -K1 (both reads and rights, all
> > sequential with 4K block sizes).  The first session completely starved
> > the second.  The first one continued at only slightly reduced speed
> > while the second one (kicked off just as fast as we could hit the enter
> > key) received only roughly 50 IOPS.  Yes, that's fifty.
> > 
> > Frightening but I thought I had better pass along such extreme results
> > to the multipath team.  Thanks - John
> HOLD THE PRESSES - This turned out to be a DIFFERENT problem.  Argh!
> That's what I get for being a management type out of my depth doing
> engineering until we hire our engineering staff!
> 
> As mentioned, these tests were run on a different, lighter duty system.
> When we ran the same tests on the larger, four dedicated SAN port
> server, RAID0 scaled nicely showing little degradation between one
> thread and four concurrent threads, i.e., our test file transfers took
> almost the same when a single user did them as opposed to when four
> users did them concurrently.
> 
> The problem with our other system was, the RAID (and probably
> multi-path) was backfiring because the iSCSI connection was buckling
> under any appreciable load because the Ethernet interfaces use bridging.
> 
> These are much lighter duty systems and we bought them from the same
> vendor as the SAN but with only the two onboard Ethernet ports.  Being
> ignorant, we looked to them for design guidance (and they were excellent
> in all other regards) and were not cautioned about sharing these
> interfaces.  Because these are light duty, we intentionally broke the
> cardinal rule of not using a dedicated SAN network for them.  That's not
> so much the problem. However, because they are running KVM, the
> interfaces are bridged (actually bonded and bridged using tlb as alb
> breaks with bridging in its current implementation - but bonding is not
> the issue).  Under any appreciable load, the iSCSI connections time out.
> We've tried varying the noop time out values but with no success.  We do
> not have the time to test rigorously but assume this is why throughput
> did not scale at all.  disktest with -K10 achieved the same throughput
> as disktest with -K1.  Oh well, the price of tuition.

Uhm, so there was virtualization in the mix.. I didn't realize that earlier..

Did you benchmark from the host or from the guest? 

So yeah.. the RAID-setup is working now, if I understood you correctly.. 
but the multipath setup is still problematic? 

-- Pasi

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Shell Scripts or Arbitrary Priority Callouts?
  2009-03-29 18:09                                       ` Pasi Kärkkäinen
@ 2009-03-29 21:22                                         ` John A. Sullivan III
  0 siblings, 0 replies; 34+ messages in thread
From: John A. Sullivan III @ 2009-03-29 21:22 UTC (permalink / raw)
  To: device-mapper development; +Cc: Ross S. W. Walker

On Sun, 2009-03-29 at 21:09 +0300, Pasi Kärkkäinen wrote:
> On Fri, Mar 27, 2009 at 02:28:45PM -0400, John A. Sullivan III wrote:
> > On Fri, 2009-03-27 at 03:03 -0400, John A. Sullivan III wrote:
> > > On Wed, 2009-03-25 at 12:21 -0400, John A. Sullivan III wrote:
> > > > On Wed, 2009-03-25 at 17:52 +0200, Pasi Kärkkäinen wrote:
> > > > > On Tue, Mar 24, 2009 at 11:41:00PM -0400, John A. Sullivan III wrote:
> > > > > > > > Latency seems to be our key.  If I can add only 20 micro-seconds of
> > > > > > > > latency from initiator and target each, that would be roughly 200 micro
> > > > > > > > seconds.  That would almost triple the throughput from what we are
> > > > > > > > currently seeing.
> > > > > > > > 
> > > > > > > 
> > > > > > > Indeed :) 
> > > > > > > 
> > > > > > > > Unfortunately, I'm a bit ignorant of tweaking networks on opensolaris.
> > > > > > > > I can certainly learn but am I headed in the right direction or is this
> > > > > > > > direction of investigation misguided? Thanks - John
> > > > > > > > 
> > > > > > > 
> > > > > > > Low latency is the key for good (iSCSI) SAN performance, as it directly
> > > > > > > gives you more (possible) IOPS. 
> > > > > > > 
> > > > > > > Other option is to configure software/settings so that there are multiple
> > > > > > > outstanding IO's on the fly.. then you're not limited with the latency (so much).
> > > > > > > 
> > > > > > > -- Pasi
> > > > > > <snip>
> > > > > > Ross has been of enormous help offline.  Indeed, disabling jumbo packets
> > > > > > produced an almost 50% increase in single threaded throughput.  We are
> > > > > > pretty well set although still a bit disappointed in the latency we are
> > > > > > seeing in opensolaris and have escalated to the vendor about addressing
> > > > > > it.
> > > > > > 
> > > > > 
> > > > > Ok. That's pretty big increase. Did you figure out why that happens? 
> > > > Greater latency with jumbo packets.
> > > > > 
> > > > > > The once piece which is still a mystery is why using four targets on
> > > > > > four separate interfaces striped with dmadm RAID0 does not produce an
> > > > > > aggregate of slightly less than four times the IOPS of a single target
> > > > > > on a single interface. This would not seem to be the out of order SCSI
> > > > > > command problem of multipath.  One of life's great mysteries yet to be
> > > > > > revealed.  Thanks again, all - John
> > > > > 
> > > > > Hmm.. maybe the out-of-order problem happens at the target? It gets IO
> > > > > requests to nearby offsets from 4 different sessions and there's some kind
> > > > > of locking or so going on? 
> > > > Ross pointed out a flaw in my test methodology.  By running one I/O at a
> > > > time, it was literally doing that - not one full RAID0 I/O but one disk
> > > > I/O apparently.  He said to truly test it, I would need to run as many
> > > > concurrent I/Os as there were disks in the array.  Thanks - John
> > > > ><snip>
> > > Argh!!! This turned out to be alarmingly untrue.  This time, we were
> > > doing some light testing on a different server with two bonded
> > > interfaces in a single bridge (KVM environment) going to the same SAM we
> > > used in our four port test.
> > > 
> > > For kicks and to prove to ourselves that RAID0 scaled with multiple I/O
> > > as opposed to limiting the test to only single I/O, we tried some actual
> > > file transfers to the SAN mounted in sync mode.  We found concurrently
> > > transferring two identical files to the RAID0 array composed of two
> > > iSCSI attached drives was 57% slower than concurrently transferring the
> > > files to the drives separately. In other words, copying file1 and file2
> > > concurrently to RAID0 took 57% longer than concurrently copying file1 to
> > > disk1 and file2 to disk2.
> > > 
> > > We then took a little different approach and used disktest.  We ran two
> > > concurrent sessions with -K1.  In one case, we ran both sessions to the
> > > 2 disk RAID0 array.  The performance was significantly less again, than
> > > running the two concurrent tests against two separate iSCSI disks.  Just
> > > to be clear, these were the same disks as composed the array, just not
> > > grouped in the array.
> > > 
> > > Even more alarmingly, we did the same test using multipath multibus,
> > > i.e., two concurrent disktest with -K1 (both reads and rights, all
> > > sequential with 4K block sizes).  The first session completely starved
> > > the second.  The first one continued at only slightly reduced speed
> > > while the second one (kicked off just as fast as we could hit the enter
> > > key) received only roughly 50 IOPS.  Yes, that's fifty.
> > > 
> > > Frightening but I thought I had better pass along such extreme results
> > > to the multipath team.  Thanks - John
> > HOLD THE PRESSES - This turned out to be a DIFFERENT problem.  Argh!
> > That's what I get for being a management type out of my depth doing
> > engineering until we hire our engineering staff!
> > 
> > As mentioned, these tests were run on a different, lighter duty system.
> > When we ran the same tests on the larger, four dedicated SAN port
> > server, RAID0 scaled nicely showing little degradation between one
> > thread and four concurrent threads, i.e., our test file transfers took
> > almost the same when a single user did them as opposed to when four
> > users did them concurrently.
> > 
> > The problem with our other system was, the RAID (and probably
> > multi-path) was backfiring because the iSCSI connection was buckling
> > under any appreciable load because the Ethernet interfaces use bridging.
> > 
> > These are much lighter duty systems and we bought them from the same
> > vendor as the SAN but with only the two onboard Ethernet ports.  Being
> > ignorant, we looked to them for design guidance (and they were excellent
> > in all other regards) and were not cautioned about sharing these
> > interfaces.  Because these are light duty, we intentionally broke the
> > cardinal rule of not using a dedicated SAN network for them.  That's not
> > so much the problem. However, because they are running KVM, the
> > interfaces are bridged (actually bonded and bridged using tlb as alb
> > breaks with bridging in its current implementation - but bonding is not
> > the issue).  Under any appreciable load, the iSCSI connections time out.
> > We've tried varying the noop time out values but with no success.  We do
> > not have the time to test rigorously but assume this is why throughput
> > did not scale at all.  disktest with -K10 achieved the same throughput
> > as disktest with -K1.  Oh well, the price of tuition.
> 
> Uhm, so there was virtualization in the mix.. I didn't realize that earlier..
> 
> Did you benchmark from the host or from the guest? 
> 
> So yeah.. the RAID-setup is working now, if I understood you correctly.. 
> but the multipath setup is still problematic? 
<snip>
I wish I had more time to pursue this but we lost an unexpected five
weeks on this on a three month project.  I would seem the RAID0 is
scaling although we haven't absolutely verified it.  Multipath does as
well.  We did find wide variations in how well it scales based upon the
combination of rr_min_io and maximum number of concurrent transactions.
As I believe would be expected, lower rr_min_io (as in 1) performed much
better under few concurrent threads (e.g., 10) but started to lose its
edge under heavier concurrent threads (e.g., 100).  Thus my ignorant
guess is for something low density virtualization, the former works
better and for high density (such as our desktop setup) higher works
better.  Of course, even those definitions vary widely depending on the
nature of the guests.

RAID0 was more even, being just slightly slower than rr=1 under light
loads and just slightly slower than rr=10 under moderate and heavy
loads.

Tests were run from the virtualization host.  Thanks for all your help -
John
-- 
John A. Sullivan III
Open Source Development Corporation
+1 207-985-7880
jsullivan@opensourcedevel.com

http://www.spiritualoutreach.com
Making Christianity intelligible to secular society

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2009-03-29 21:22 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-03-18 22:57 Shell Scripts or Arbitrary Priority Callouts? Christopher Chen
2009-03-19 11:04 ` John A. Sullivan III
2009-03-20  5:11   ` Christopher Chen
2009-03-20 10:01     ` John A. Sullivan III
2009-03-22 15:27       ` Pasi Kärkkäinen
2009-03-22 16:50         ` John A. Sullivan III
2009-03-23  4:42         ` Christopher Chen
2009-03-23  9:46         ` John A. Sullivan III
     [not found]           ` <CF307021-DE23-4BB1-BC6D-F4F520464208@medallion.com>
2009-03-23 13:07             ` John A. Sullivan III
2009-03-24  7:39           ` Pasi Kärkkäinen
2009-03-24 11:02             ` John A. Sullivan III
2009-03-24 11:57               ` Pasi Kärkkäinen
2009-03-24 12:21                 ` John A. Sullivan III
2009-03-24 15:01                   ` Pasi Kärkkäinen
2009-03-24 15:09                     ` Pasi Kärkkäinen
2009-03-24 15:43                     ` John A. Sullivan III
2009-03-24 16:36                       ` Pasi Kärkkäinen
2009-03-24 17:30                         ` John A. Sullivan III
2009-03-24 18:17                           ` Pasi Kärkkäinen
2009-03-25  3:41                             ` John A. Sullivan III
2009-03-25 15:52                               ` Pasi Kärkkäinen
2009-03-25 16:21                                 ` John A. Sullivan III
2009-03-27  7:03                                   ` John A. Sullivan III
2009-03-27  9:02                                     ` Pasi Kärkkäinen
2009-03-27  9:14                                       ` John A. Sullivan III
     [not found]                                     ` <2A4646CA-45AF-460A-8E43-A6D9C070842A@medallion.com>
     [not found]                                       ` <1238165063.6574.97.camel@jaspav.missionsit.net.missionsit.net>
     [not found]                                         ` <E2BB8074E5500C42984D980D4BD78EF9029E3CFA@MFG-NYC-EXCH2.mfg.prv>
2009-03-27 17:15                                           ` John A. Sullivan III
2009-03-27 18:06                                             ` John A. Sullivan III
2009-03-27 18:23                                               ` Ross S. W. Walker
2009-03-27 18:28                                     ` John A. Sullivan III
2009-03-29 18:09                                       ` Pasi Kärkkäinen
2009-03-29 21:22                                         ` John A. Sullivan III
2009-03-25  3:44                             ` John A. Sullivan III
2009-03-25 15:52                               ` Pasi Kärkkäinen
2009-03-25 16:19                                 ` John A. Sullivan III

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.