All of lore.kernel.org
 help / color / mirror / Atom feed
* IPoIB performance
@ 2012-08-29 19:35 Atchley, Scott
       [not found] ` <FFD82983-4A73-4A52-B6BE-C63DA16A507C-1Heg1YXhbW8@public.gmane.org>
  0 siblings, 1 reply; 18+ messages in thread
From: Atchley, Scott @ 2012-08-29 19:35 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA

Hi all,

I am benchmarking a sockets based application and I want a sanity check on IPoIB performance expectations when using connected mode (65520 MTU). I am using the tuning tips in Documentation/infiniband/ipoib.txt. The machines have Mellanox QDR cards (see below for the verbose ibv_devinfo output). I am using a 2.6.36 kernel. The hosts have single socket Intel E5520 (4 core with hyper-threading on) at 2.27 GHz.

I am using netperf's TCP_STREAM and binding cores. The best I have seen is ~13 Gbps. Is this the best I can expect from these cards?

What should I expect as a max for ipoib with FDR cards?

Thanks,

Scott



hca_id: mlx4_0
        transport:                      InfiniBand (0)
        fw_ver:                         2.7.626
        node_guid:                      0002:c903:000b:6520
        sys_image_guid:                 0002:c903:000b:6523
        vendor_id:                      0x02c9
        vendor_part_id:                 26428
        hw_ver:                         0xB0
        board_id:                       MT_0D90110009
        phys_port_cnt:                  1
        max_mr_size:                    0xffffffffffffffff
        page_size_cap:                  0xfffffe00
        max_qp:                         65464
        max_qp_wr:                      16384
        device_cap_flags:               0x006c9c76
        max_sge:                        32
        max_sge_rd:                     0
        max_cq:                         65408
        max_cqe:                        4194303
        max_mr:                         131056
        max_pd:                         32764
        max_qp_rd_atom:                 16
        max_ee_rd_atom:                 0
        max_res_rd_atom:                1047424
        max_qp_init_rd_atom:            128
        max_ee_init_rd_atom:            0
        atomic_cap:                     ATOMIC_HCA (1)
        max_ee:                         0
        max_rdd:                        0
        max_mw:                         0
        max_raw_ipv6_qp:                0
        max_raw_ethy_qp:                0
        max_mcast_grp:                  8192
        max_mcast_qp_attach:            56
        max_total_mcast_qp_attach:      458752
        max_ah:                         0
        max_fmr:                        0
        max_srq:                        65472
        max_srq_wr:                     16383
        max_srq_sge:                    31
        max_pkeys:                      128
        local_ca_ack_delay:             15
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                2048 (4)
                        active_mtu:             2048 (4)
                        sm_lid:                 6
                        port_lid:               8
                        port_lmc:               0x00
                        link_layer:             InfiniBand
                        max_msg_sz:             0x40000000
                        port_cap_flags:         0x02510868
                        max_vl_num:             8 (4)
                        bad_pkey_cntr:          0x0
                        qkey_viol_cntr:         0x0
                        sm_sl:                  0
                        pkey_tbl_len:           128
                        gid_tbl_len:            128
                        subnet_timeout:         18
                        init_type_reply:        0
                        active_width:           4X (2)
                        active_speed:           10.0 Gbps (4)
                        phys_state:             LINK_UP (5)
                        GID[  0]:               fe80:0000:0000:0000:0002:c903:000b:6521

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: IPoIB performance
       [not found] ` <FFD82983-4A73-4A52-B6BE-C63DA16A507C-1Heg1YXhbW8@public.gmane.org>
@ 2012-09-05 15:51   ` Christoph Lameter
       [not found]     ` <0000013997217ec3-f6e2593f-2ff8-408d-814e-0345582b31ca-000000-p/GC64/jrecnJqMo6gzdpkEOCMrvLtNR@public.gmane.org>
  2012-09-05 17:50   ` Reeted
  1 sibling, 1 reply; 18+ messages in thread
From: Christoph Lameter @ 2012-09-05 15:51 UTC (permalink / raw)
  To: Atchley, Scott; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Wed, 29 Aug 2012, Atchley, Scott wrote:

> I am benchmarking a sockets based application and I want a sanity check
> on IPoIB performance expectations when using connected mode (65520 MTU).
> I am using the tuning tips in Documentation/infiniband/ipoib.txt. The
> machines have Mellanox QDR cards (see below for the verbose ibv_devinfo
> output). I am using a 2.6.36 kernel. The hosts have single socket Intel
> E5520 (4 core with hyper-threading on) at 2.27 GHz.
>
> I am using netperf's TCP_STREAM and binding cores. The best I have seen
> is ~13 Gbps. Is this the best I can expect from these cards?

Sounds about right, This is not a hardware limitation but
a limitation of the socket I/O layer / PCI-E bus. The cards generally can
process more data than the PCI bus and the OS can handle.

PCI-E on PCI 2.0 should give you up to about 2.3 Gbytes/sec with these
nics. So there is like something that the network layer does to you that
limits the bandwidth.

> What should I expect as a max for ipoib with FDR cards?

More of the same. You may want to

A) increase the block size handled by the socket layer

B) Increase the bandwidth by using PCI-E 3 or more PCI-E lanes.

C) Bypass the socket layer. Look at Sean's rsockets layer f.e.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: IPoIB performance
       [not found]     ` <0000013997217ec3-f6e2593f-2ff8-408d-814e-0345582b31ca-000000-p/GC64/jrecnJqMo6gzdpkEOCMrvLtNR@public.gmane.org>
@ 2012-09-05 17:09       ` Atchley, Scott
       [not found]         ` <3F476926-8618-4233-A150-C5D487B55C68-1Heg1YXhbW8@public.gmane.org>
  2012-09-05 17:52       ` Reeted
  1 sibling, 1 reply; 18+ messages in thread
From: Atchley, Scott @ 2012-09-05 17:09 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Sep 5, 2012, at 11:51 AM, Christoph Lameter wrote:

> On Wed, 29 Aug 2012, Atchley, Scott wrote:
> 
>> I am benchmarking a sockets based application and I want a sanity check
>> on IPoIB performance expectations when using connected mode (65520 MTU).
>> I am using the tuning tips in Documentation/infiniband/ipoib.txt. The
>> machines have Mellanox QDR cards (see below for the verbose ibv_devinfo
>> output). I am using a 2.6.36 kernel. The hosts have single socket Intel
>> E5520 (4 core with hyper-threading on) at 2.27 GHz.
>> 
>> I am using netperf's TCP_STREAM and binding cores. The best I have seen
>> is ~13 Gbps. Is this the best I can expect from these cards?
> 
> Sounds about right, This is not a hardware limitation but
> a limitation of the socket I/O layer / PCI-E bus. The cards generally can
> process more data than the PCI bus and the OS can handle.
> 
> PCI-E on PCI 2.0 should give you up to about 2.3 Gbytes/sec with these
> nics. So there is like something that the network layer does to you that
> limits the bandwidth.

First, thanks for the reply.

I am not sure where are are getting the 2.3 GB/s value. When using verbs natively, I can get ~3.4 GB/s. I am assuming that these HCAs lack certain TCP offloads that might allow higher Socket performance. Ethtool reports:

# ethtool -k ib0
Offload parameters for ib0:
rx-checksumming: off
tx-checksumming: off
scatter-gather: off
tcp segmentation offload: off
udp fragmentation offload: off
generic segmentation offload: on
generic-receive-offload: off

There is no checksum support which I would expect to lower performance. Since checksums need to be calculated in the host, I would expect faster processors to help performance some.

So basically, am I in the ball park given this hardware?

> 
>> What should I expect as a max for ipoib with FDR cards?
> 
> More of the same. You may want to
> 
> A) increase the block size handled by the socket layer

Do you mean altering sysctl with something like:

# increase TCP max buffer size setable using setsockopt()
net.core.rmem_max = 16777216 
net.core.wmem_max = 16777216 
# increase Linux autotuning TCP buffer limit 
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
# increase the length of the processor input queue
net.core.netdev_max_backlog = 30000

or something increasing the SO_SNFBUF and SO_RCVBUF sizes or something else?

> B) Increase the bandwidth by using PCI-E 3 or more PCI-E lanes.
> 
> C) Bypass the socket layer. Look at Sean's rsockets layer f.e.

We actually want to test the socket stack and not bypass it.

Thanks again!

Scott

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: IPoIB performance
       [not found] ` <FFD82983-4A73-4A52-B6BE-C63DA16A507C-1Heg1YXhbW8@public.gmane.org>
  2012-09-05 15:51   ` Christoph Lameter
@ 2012-09-05 17:50   ` Reeted
       [not found]     ` <504790D8.7010307-9AbUPqfR1/2XDw4h08c5KA@public.gmane.org>
  1 sibling, 1 reply; 18+ messages in thread
From: Reeted @ 2012-09-05 17:50 UTC (permalink / raw)
  To: Atchley, Scott; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA

On 08/29/12 21:35, Atchley, Scott wrote:
> Hi all,
>
> I am benchmarking a sockets based application and I want a sanity check on IPoIB performance expectations when using connected mode (65520 MTU).....

I have read that with newer cards the datagram (unconnected) mode is 
faster at IPoIB than connected mode. Do you want to check?

What benchmark program are you using?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: IPoIB performance
       [not found]     ` <0000013997217ec3-f6e2593f-2ff8-408d-814e-0345582b31ca-000000-p/GC64/jrecnJqMo6gzdpkEOCMrvLtNR@public.gmane.org>
  2012-09-05 17:09       ` Atchley, Scott
@ 2012-09-05 17:52       ` Reeted
  1 sibling, 0 replies; 18+ messages in thread
From: Reeted @ 2012-09-05 17:52 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Atchley, Scott, linux-rdma-u79uwXL29TY76Z2rM5mHXA

On 09/05/12 17:51, Christoph Lameter wrote:
> PCI-E on PCI 2.0 should give you up to about 2.3 Gbytes/sec with these
> nics. So there is like something that the network layer does to you that
> limits the bandwidth.

I think those are 8 lane PCI-e 2.0 so that would be 500MB/sec x 8 that's 
4 GBytes/sec. Or you really mean there is almost 50% overhead?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: IPoIB performance
       [not found]     ` <504790D8.7010307-9AbUPqfR1/2XDw4h08c5KA@public.gmane.org>
@ 2012-09-05 17:59       ` Atchley, Scott
       [not found]         ` <64A9A0CD-00E0-4455-A641-324FA9BB8BC2-1Heg1YXhbW8@public.gmane.org>
  0 siblings, 1 reply; 18+ messages in thread
From: Atchley, Scott @ 2012-09-05 17:59 UTC (permalink / raw)
  To: Reeted; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Sep 5, 2012, at 1:50 PM, Reeted wrote:

> On 08/29/12 21:35, Atchley, Scott wrote:
>> Hi all,
>> 
>> I am benchmarking a sockets based application and I want a sanity check on IPoIB performance expectations when using connected mode (65520 MTU).....
> 
> I have read that with newer cards the datagram (unconnected) mode is 
> faster at IPoIB than connected mode. Do you want to check?

I have read that the latency is lower (better) but the bandwidth is lower.

Using datagram mode limits the MTU to 2044 and the throughput to ~3 Gb/s on these machines/cards. Connected mode at the same MTU performs roughly the same. The win in connected mode comes with larger MTUs. With a 9000 MTU, I see ~6 Gb/s. Pushing the MTU to 655120 (the maximum for ipoib), I can get ~13 Gb/s.

> What benchmark program are you using?

netperf with process binding (-T). I tune sysctl per the DOE FasterData specs:

http://fasterdata.es.net/host-tuning/linux/

Scott--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: IPoIB performance
       [not found]         ` <3F476926-8618-4233-A150-C5D487B55C68-1Heg1YXhbW8@public.gmane.org>
@ 2012-09-05 18:20           ` Christoph Lameter
       [not found]             ` <0000013997a928ab-36ad5a02-3c82-4daf-8e8a-a86c65e92376-000000-p/GC64/jrecnJqMo6gzdpkEOCMrvLtNR@public.gmane.org>
  0 siblings, 1 reply; 18+ messages in thread
From: Christoph Lameter @ 2012-09-05 18:20 UTC (permalink / raw)
  To: Atchley, Scott; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Wed, 5 Sep 2012, Atchley, Scott wrote:

> # ethtool -k ib0
> Offload parameters for ib0:
> rx-checksumming: off
> tx-checksumming: off
> scatter-gather: off
> tcp segmentation offload: off
> udp fragmentation offload: off
> generic segmentation offload: on
> generic-receive-offload: off
>
> There is no checksum support which I would expect to lower performance.
> Since checksums need to be calculated in the host, I would expect faster
> processors to help performance some.

K that is a major problem. Both are on by default here. What NIC is this?

> > A) increase the block size handled by the socket layer
>
> Do you mean altering sysctl with something like:

Nope increase mtu. Connected mode supports up to 64k mtu size I believe.

> or something increasing the SO_SNFBUF and SO_RCVBUF sizes or something else?

That does nothing for performance. The problem is that the handling of the
data by the kernel causes too much latency so that you cannot reach the
full bw of the hardware.

> We actually want to test the socket stack and not bypass it.

AFAICT the network stack is useful up to 1Gbps and
after that more and more band-aid comes into play.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: IPoIB performance
       [not found]             ` <0000013997a928ab-36ad5a02-3c82-4daf-8e8a-a86c65e92376-000000-p/GC64/jrecnJqMo6gzdpkEOCMrvLtNR@public.gmane.org>
@ 2012-09-05 18:30               ` Atchley, Scott
       [not found]                 ` <78D8717C-0505-408B-8625-A9124AB33C9E-1Heg1YXhbW8@public.gmane.org>
  0 siblings, 1 reply; 18+ messages in thread
From: Atchley, Scott @ 2012-09-05 18:30 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Sep 5, 2012, at 2:20 PM, Christoph Lameter wrote:

> On Wed, 5 Sep 2012, Atchley, Scott wrote:
> 
>> # ethtool -k ib0
>> Offload parameters for ib0:
>> rx-checksumming: off
>> tx-checksumming: off
>> scatter-gather: off
>> tcp segmentation offload: off
>> udp fragmentation offload: off
>> generic segmentation offload: on
>> generic-receive-offload: off
>> 
>> There is no checksum support which I would expect to lower performance.
>> Since checksums need to be calculated in the host, I would expect faster
>> processors to help performance some.
> 
> K that is a major problem. Both are on by default here. What NIC is this?

These are Mellanox QDR HCAs (board id is MT_0D90110009). The full output of ibv_devinfo is in my original post.

>>> A) increase the block size handled by the socket layer
>> 
>> Do you mean altering sysctl with something like:
> 
> Nope increase mtu. Connected mode supports up to 64k mtu size I believe.

Yes, I am using the max MTU (65520).

>> or something increasing the SO_SNFBUF and SO_RCVBUF sizes or something else?
> 
> That does nothing for performance. The problem is that the handling of the
> data by the kernel causes too much latency so that you cannot reach the
> full bw of the hardware.
> 
>> We actually want to test the socket stack and not bypass it.
> 
> AFAICT the network stack is useful up to 1Gbps and
> after that more and more band-aid comes into play.

Hmm, many 10G Ethernet NICs can reach line rate. I have not yet tested any 40G Ethernet NICs, but I hope that they will get close to line rate. If not, what is the point? ;-)

Scott--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: IPoIB performance
       [not found]         ` <64A9A0CD-00E0-4455-A641-324FA9BB8BC2-1Heg1YXhbW8@public.gmane.org>
@ 2012-09-05 19:04           ` Reeted
       [not found]             ` <5047A233.2040602-9AbUPqfR1/2XDw4h08c5KA@public.gmane.org>
  0 siblings, 1 reply; 18+ messages in thread
From: Reeted @ 2012-09-05 19:04 UTC (permalink / raw)
  To: Atchley, Scott; +Cc: linux-rdma

On 09/05/12 19:59, Atchley, Scott wrote:
> On Sep 5, 2012, at 1:50 PM, Reeted wrote:
>
>>
>> I have read that with newer cards the datagram (unconnected) mode is
>> faster at IPoIB than connected mode. Do you want to check?
> I have read that the latency is lower (better) but the bandwidth is lower.
>
> Using datagram mode limits the MTU to 2044 and the throughput to ~3 Gb/s on these machines/cards. Connected mode at the same MTU performs roughly the same. The win in connected mode comes with larger MTUs. With a 9000 MTU, I see ~6 Gb/s. Pushing the MTU to 655120 (the maximum for ipoib), I can get ~13 Gb/s.
>

Have a look at an old thread in this ML by Sebastien Dugue "IPoIB to 
Ethernet routing performance"
He had numbers much higher than yours on similar hardware, and was 
suggested to use datagram to achieve offloading and even higher speeds.
Keep me informed if you can fix this, I am interested but can't test 
infiniband myself right now.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: IPoIB performance
       [not found]                 ` <78D8717C-0505-408B-8625-A9124AB33C9E-1Heg1YXhbW8@public.gmane.org>
@ 2012-09-05 19:06                   ` Christoph Lameter
       [not found]                     ` <0000013997d35848-57319f33-839b-4480-a075-53b36f67bfe2-000000-p/GC64/jrecnJqMo6gzdpkEOCMrvLtNR@public.gmane.org>
  2012-09-05 19:13                   ` Christoph Lameter
  1 sibling, 1 reply; 18+ messages in thread
From: Christoph Lameter @ 2012-09-05 19:06 UTC (permalink / raw)
  To: Atchley, Scott; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Wed, 5 Sep 2012, Atchley, Scott wrote:

> > AFAICT the network stack is useful up to 1Gbps and
> > after that more and more band-aid comes into play.
>
> Hmm, many 10G Ethernet NICs can reach line rate. I have not yet tested any 40G Ethernet NICs, but I hope that they will get close to line rate. If not, what is the point? ;-)

Oh yes they can under restricted circumstances. Large packets, multiple
cores etc. With the band-aids....

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: IPoIB performance
       [not found]                 ` <78D8717C-0505-408B-8625-A9124AB33C9E-1Heg1YXhbW8@public.gmane.org>
  2012-09-05 19:06                   ` Christoph Lameter
@ 2012-09-05 19:13                   ` Christoph Lameter
       [not found]                     ` <0000013997da4014-44dda0e8-01f3-48b8-b0cd-fe41164d590c-000000-p/GC64/jrecnJqMo6gzdpkEOCMrvLtNR@public.gmane.org>
  1 sibling, 1 reply; 18+ messages in thread
From: Christoph Lameter @ 2012-09-05 19:13 UTC (permalink / raw)
  To: Atchley, Scott; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Wed, 5 Sep 2012, Atchley, Scott wrote:

> These are Mellanox QDR HCAs (board id is MT_0D90110009). The full output of ibv_devinfo is in my original post.

Hmmm... You are running an old kernel. What version of OFED do you use?


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: IPoIB performance
       [not found]             ` <5047A233.2040602-9AbUPqfR1/2XDw4h08c5KA@public.gmane.org>
@ 2012-09-05 19:46               ` Atchley, Scott
  0 siblings, 0 replies; 18+ messages in thread
From: Atchley, Scott @ 2012-09-05 19:46 UTC (permalink / raw)
  To: Reeted; +Cc: linux-rdma

On Sep 5, 2012, at 3:04 PM, Reeted wrote:

> On 09/05/12 19:59, Atchley, Scott wrote:
>> On Sep 5, 2012, at 1:50 PM, Reeted wrote:
>> 
>>> 
>>> I have read that with newer cards the datagram (unconnected) mode is
>>> faster at IPoIB than connected mode. Do you want to check?
>> I have read that the latency is lower (better) but the bandwidth is lower.
>> 
>> Using datagram mode limits the MTU to 2044 and the throughput to ~3 Gb/s on these machines/cards. Connected mode at the same MTU performs roughly the same. The win in connected mode comes with larger MTUs. With a 9000 MTU, I see ~6 Gb/s. Pushing the MTU to 655120 (the maximum for ipoib), I can get ~13 Gb/s.
>> 
> 
> Have a look at an old thread in this ML by Sebastien Dugue "IPoIB to 
> Ethernet routing performance"
> He had numbers much higher than yours on similar hardware, and was 
> suggested to use datagram to achieve offloading and even higher speeds.
> Keep me informed if you can fix this, I am interested but can't test 
> infiniband myself right now.

He claims 20 Gb/s and Or replies that one should also get near 20 Gb/s using datagram mode. I checked and datagram mode shows support via ethtool for more offloads. In my case, I still see better performance with connected mode.

Thanks,

Scott--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: IPoIB performance
       [not found]                     ` <0000013997d35848-57319f33-839b-4480-a075-53b36f67bfe2-000000-p/GC64/jrecnJqMo6gzdpkEOCMrvLtNR@public.gmane.org>
@ 2012-09-05 19:48                       ` Atchley, Scott
       [not found]                         ` <E88DEA9F-416B-4663-A292-5780DFF9B641-1Heg1YXhbW8@public.gmane.org>
  0 siblings, 1 reply; 18+ messages in thread
From: Atchley, Scott @ 2012-09-05 19:48 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Sep 5, 2012, at 3:06 PM, Christoph Lameter wrote:

> On Wed, 5 Sep 2012, Atchley, Scott wrote:
> 
>>> AFAICT the network stack is useful up to 1Gbps and
>>> after that more and more band-aid comes into play.
>> 
>> Hmm, many 10G Ethernet NICs can reach line rate. I have not yet tested any 40G Ethernet NICs, but I hope that they will get close to line rate. If not, what is the point? ;-)
> 
> Oh yes they can under restricted circumstances. Large packets, multiple
> cores etc. With the band-aids….

With Myricom 10G NICs, for example, you just need one core and it can do line rate with 1500 byte MTU. Do you count the stateless offloads as band-aids? Or something else?

I have not tested any 40G NICs yet, but I imagine that one core will not be enough.

Thanks,

Scott--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: IPoIB performance
       [not found]                     ` <0000013997da4014-44dda0e8-01f3-48b8-b0cd-fe41164d590c-000000-p/GC64/jrecnJqMo6gzdpkEOCMrvLtNR@public.gmane.org>
@ 2012-09-05 19:52                       ` Atchley, Scott
       [not found]                         ` <F90894CF-B55C-4AF7-845C-279FDF44351E-1Heg1YXhbW8@public.gmane.org>
  0 siblings, 1 reply; 18+ messages in thread
From: Atchley, Scott @ 2012-09-05 19:52 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA


On Sep 5, 2012, at 3:13 PM, Christoph Lameter wrote:

> On Wed, 5 Sep 2012, Atchley, Scott wrote:
> 
>> These are Mellanox QDR HCAs (board id is MT_0D90110009). The full output of ibv_devinfo is in my original post.
> 
> Hmmm... You are running an old kernel. What version of OFED do you use?

Hah, if you think my kernel is old, you should see my userland (RHEL5.5). ;-)

Does the version of OFED impact the kernel modules? I am using the modules that came with the kernel. I don't believe that libibverbs or librdmacm are used by the kernel's socket stack. That said, I am using source builds with tags libibverbs-1.1.6 and v1.0.16 (librdmacm).

Scott--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: IPoIB performance
       [not found]                         ` <E88DEA9F-416B-4663-A292-5780DFF9B641-1Heg1YXhbW8@public.gmane.org>
@ 2012-09-05 19:53                           ` Christoph Lameter
  2012-09-05 20:12                           ` Ezra Kissel
  1 sibling, 0 replies; 18+ messages in thread
From: Christoph Lameter @ 2012-09-05 19:53 UTC (permalink / raw)
  To: Atchley, Scott; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Wed, 5 Sep 2012, Atchley, Scott wrote:

> With Myricom 10G NICs, for example, you just need one core and it can do
> line rate with 1500 byte MTU. Do you count the stateless offloads as
> band-aids? Or something else?

The stateless aids also have certain limitations. Its a grey zone if you
want to call them band aids. It gets there at some point because stateless
offload can only get you so far. The need to send larger sized packets
through the kernel increases the latency and forces the app to do larger
batching. Its not very useful if you need to send small packets to a
variety of receivers.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: IPoIB performance
       [not found]                         ` <E88DEA9F-416B-4663-A292-5780DFF9B641-1Heg1YXhbW8@public.gmane.org>
  2012-09-05 19:53                           ` Christoph Lameter
@ 2012-09-05 20:12                           ` Ezra Kissel
       [not found]                             ` <5047B23C.6080604-GZvvpLG7cYSVc3sceRu5cw@public.gmane.org>
  1 sibling, 1 reply; 18+ messages in thread
From: Ezra Kissel @ 2012-09-05 20:12 UTC (permalink / raw)
  To: Atchley, Scott; +Cc: Christoph Lameter, linux-rdma-u79uwXL29TY76Z2rM5mHXA

On 9/5/2012 3:48 PM, Atchley, Scott wrote:
> On Sep 5, 2012, at 3:06 PM, Christoph Lameter wrote:
>
>> On Wed, 5 Sep 2012, Atchley, Scott wrote:
>>
>>>> AFAICT the network stack is useful up to 1Gbps and
>>>> after that more and more band-aid comes into play.
>>>
>>> Hmm, many 10G Ethernet NICs can reach line rate. I have not yet tested any 40G Ethernet NICs, but I hope that they will get close to line rate. If not, what is the point? ;-)
>>
>> Oh yes they can under restricted circumstances. Large packets, multiple
>> cores etc. With the band-aids….
>
> With Myricom 10G NICs, for example, you just need one core and it can do line rate with 1500 byte MTU. Do you count the stateless offloads as band-aids? Or something else?
>
> I have not tested any 40G NICs yet, but I imagine that one core will not be enough.
>
Since you are using netperf, you might also considering experimenting 
with the TCP_SENDFILE test.  Using sendfile/splice calls can have a 
significant impact for sockets-based apps.

Using 40G NICs (Mellanox ConnectX-3 EN), I've seen our applications hit 
22Gb/s single core/stream while fully CPU bound.  With sendfile/splice, 
there is no issue saturating a 40G link with about 40-50% core 
utilization.  That being said, binding to the right core/node, message 
size and memory alignment, interrupt handling, and proper host/NIC 
tuning all have an impact on the performance.  The state of 
high-performance networking is certainly not plug-and-play.

- ezra
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: IPoIB performance
       [not found]                         ` <F90894CF-B55C-4AF7-845C-279FDF44351E-1Heg1YXhbW8@public.gmane.org>
@ 2012-09-05 20:26                           ` Christoph Lameter
  0 siblings, 0 replies; 18+ messages in thread
From: Christoph Lameter @ 2012-09-05 20:26 UTC (permalink / raw)
  To: Atchley, Scott; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Wed, 5 Sep 2012, Atchley, Scott wrote:

> > Hmmm... You are running an old kernel. What version of OFED do you
> > use?
>
> Hah, if you think my kernel is old, you should see my userland
> (RHEL5.5). ;-)

My condolences.

> Does the version of OFED impact the kernel modules? I am using the
> modules that came with the kernel. I don't believe that libibverbs or
> librdmacm are used by the kernel's socket stack. That said, I am using
> source builds with tags libibverbs-1.1.6 and v1.0.16 (librdmacm).

OFED includes kernel modules which provides the drivers that you need.
Installing a new OFED release on RH5 is possible and would give you up to
date drivers. Check with RH: They may have them somewhere easy to install
for your version of RH.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: IPoIB performance
       [not found]                             ` <5047B23C.6080604-GZvvpLG7cYSVc3sceRu5cw@public.gmane.org>
@ 2012-09-05 20:32                               ` Atchley, Scott
  0 siblings, 0 replies; 18+ messages in thread
From: Atchley, Scott @ 2012-09-05 20:32 UTC (permalink / raw)
  To: Ezra Kissel; +Cc: Christoph Lameter, linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Sep 5, 2012, at 4:12 PM, Ezra Kissel wrote:

> On 9/5/2012 3:48 PM, Atchley, Scott wrote:
>> On Sep 5, 2012, at 3:06 PM, Christoph Lameter wrote:
>> 
>>> On Wed, 5 Sep 2012, Atchley, Scott wrote:
>>> 
>>>>> AFAICT the network stack is useful up to 1Gbps and
>>>>> after that more and more band-aid comes into play.
>>>> 
>>>> Hmm, many 10G Ethernet NICs can reach line rate. I have not yet tested any 40G Ethernet NICs, but I hope that they will get close to line rate. If not, what is the point? ;-)
>>> 
>>> Oh yes they can under restricted circumstances. Large packets, multiple
>>> cores etc. With the band-aids….
>> 
>> With Myricom 10G NICs, for example, you just need one core and it can do line rate with 1500 byte MTU. Do you count the stateless offloads as band-aids? Or something else?
>> 
>> I have not tested any 40G NICs yet, but I imagine that one core will not be enough.
>> 
> Since you are using netperf, you might also considering experimenting 
> with the TCP_SENDFILE test.  Using sendfile/splice calls can have a 
> significant impact for sockets-based apps.
> 
> Using 40G NICs (Mellanox ConnectX-3 EN), I've seen our applications hit 
> 22Gb/s single core/stream while fully CPU bound.  With sendfile/splice, 
> there is no issue saturating a 40G link with about 40-50% core 
> utilization.  That being said, binding to the right core/node, message 
> size and memory alignment, interrupt handling, and proper host/NIC 
> tuning all have an impact on the performance.  The state of 
> high-performance networking is certainly not plug-and-play.

Thanks for the tip. The app we want to test does not use sendfile() or splice().

I do bind to the "best" core (determined by testing all combinations on client and server).

I have heard others within DOE reach ~16 Gb/s on a 40G Mellanox NIC. I'm glad to hear that you got to 22 Gb/s for a single stream. That is more reassuring.

Scott--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2012-09-05 20:32 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-08-29 19:35 IPoIB performance Atchley, Scott
     [not found] ` <FFD82983-4A73-4A52-B6BE-C63DA16A507C-1Heg1YXhbW8@public.gmane.org>
2012-09-05 15:51   ` Christoph Lameter
     [not found]     ` <0000013997217ec3-f6e2593f-2ff8-408d-814e-0345582b31ca-000000-p/GC64/jrecnJqMo6gzdpkEOCMrvLtNR@public.gmane.org>
2012-09-05 17:09       ` Atchley, Scott
     [not found]         ` <3F476926-8618-4233-A150-C5D487B55C68-1Heg1YXhbW8@public.gmane.org>
2012-09-05 18:20           ` Christoph Lameter
     [not found]             ` <0000013997a928ab-36ad5a02-3c82-4daf-8e8a-a86c65e92376-000000-p/GC64/jrecnJqMo6gzdpkEOCMrvLtNR@public.gmane.org>
2012-09-05 18:30               ` Atchley, Scott
     [not found]                 ` <78D8717C-0505-408B-8625-A9124AB33C9E-1Heg1YXhbW8@public.gmane.org>
2012-09-05 19:06                   ` Christoph Lameter
     [not found]                     ` <0000013997d35848-57319f33-839b-4480-a075-53b36f67bfe2-000000-p/GC64/jrecnJqMo6gzdpkEOCMrvLtNR@public.gmane.org>
2012-09-05 19:48                       ` Atchley, Scott
     [not found]                         ` <E88DEA9F-416B-4663-A292-5780DFF9B641-1Heg1YXhbW8@public.gmane.org>
2012-09-05 19:53                           ` Christoph Lameter
2012-09-05 20:12                           ` Ezra Kissel
     [not found]                             ` <5047B23C.6080604-GZvvpLG7cYSVc3sceRu5cw@public.gmane.org>
2012-09-05 20:32                               ` Atchley, Scott
2012-09-05 19:13                   ` Christoph Lameter
     [not found]                     ` <0000013997da4014-44dda0e8-01f3-48b8-b0cd-fe41164d590c-000000-p/GC64/jrecnJqMo6gzdpkEOCMrvLtNR@public.gmane.org>
2012-09-05 19:52                       ` Atchley, Scott
     [not found]                         ` <F90894CF-B55C-4AF7-845C-279FDF44351E-1Heg1YXhbW8@public.gmane.org>
2012-09-05 20:26                           ` Christoph Lameter
2012-09-05 17:52       ` Reeted
2012-09-05 17:50   ` Reeted
     [not found]     ` <504790D8.7010307-9AbUPqfR1/2XDw4h08c5KA@public.gmane.org>
2012-09-05 17:59       ` Atchley, Scott
     [not found]         ` <64A9A0CD-00E0-4455-A641-324FA9BB8BC2-1Heg1YXhbW8@public.gmane.org>
2012-09-05 19:04           ` Reeted
     [not found]             ` <5047A233.2040602-9AbUPqfR1/2XDw4h08c5KA@public.gmane.org>
2012-09-05 19:46               ` Atchley, Scott

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.