All of lore.kernel.org
 help / color / mirror / Atom feed
* Bonding, GRO and tcp_reordering
@ 2010-11-30 13:55 Simon Horman
  2010-11-30 15:42 ` Ben Hutchings
  2010-11-30 17:56 ` Rick Jones
  0 siblings, 2 replies; 12+ messages in thread
From: Simon Horman @ 2010-11-30 13:55 UTC (permalink / raw)
  To: netdev

Hi,

I just wanted to share what is a rather pleasing,
though to me somewhat surprising result.

I am testing bonding using balance-rr mode with three physical links to try
to get > gigabit speed for a single stream. Why?  Because I'd like to run
various tests at > gigabit speed and I don't have any 10G hardware at my
disposal.

The result I have is that with a 1500 byte MTU, tcp_reordering=3 and both
LSO and GSO disabled on both the sender and receiver I see:

# netperf -c -4 -t TCP_STREAM -H 172.17.60.216 -- -m 1472
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.17.60.216
(172.17.60.216) port 0 AF_INET
Recv   Send    Send                          Utilization       Service Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % U      us/KB   us/KB

  87380  16384   1472    10.01      1646.13   40.01    -1.00    3.982  -1.000

But with GRO enabled on the receiver I see.

# netperf -c -4 -t TCP_STREAM -H 172.17.60.216 -- -m 1472
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.17.60.216
(172.17.60.216) port 0 AF_INET
Recv   Send    Send                          Utilization       Service Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % U      us/KB   us/KB

 87380  16384   1472    10.01      2613.83   19.32    -1.00    1.211   -1.000

Which is much better than any result I get tweaking tcp_reordering when
GRO is disabled on the receiver.

Tweaking tcp_reordering when GRO is enabled on the receiver seems to have
negligible effect.  Which is interesting, because my brief reading on the
subject indicated that tcp_reordering was the key tuning parameter for
bonding with balance-rr.

The only other parameter that seemed to have significant effect was to
increase the mtu.  In the case of MTU=9000, GRO seemed to have a negative
impact on throughput, though a significant positive effect on CPU
utilisation.

MTU=9000, sender,receiver:tcp_reordering=3(default), receiver:GRO=off
netperf -c -4 -t TCP_STREAM -H 172.17.60.216 -- -m 9872
Recv   Send    Send                          Utilization       Service Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % U      us/KB   us/KB

 87380  16384   9872    10.01      2957.52   14.89    -1.00    0.825   -1.000

MTU=9000, sender,receiver:tcp_reordering=3(default), receiver:GRO=on
netperf -c -4 -t TCP_STREAM -H 172.17.60.216 -- -m 9872
Recv   Send    Send                          Utilization       Service Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % U      us/KB   us/KB

 87380  16384   9872    10.01      2847.64   10.84    -1.00    0.624   -1.000


Test run using 2.6.37-rc1

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Bonding, GRO and tcp_reordering
  2010-11-30 13:55 Bonding, GRO and tcp_reordering Simon Horman
@ 2010-11-30 15:42 ` Ben Hutchings
  2010-11-30 16:04   ` Eric Dumazet
  2010-12-01  4:31   ` Simon Horman
  2010-11-30 17:56 ` Rick Jones
  1 sibling, 2 replies; 12+ messages in thread
From: Ben Hutchings @ 2010-11-30 15:42 UTC (permalink / raw)
  To: Simon Horman; +Cc: netdev

On Tue, 2010-11-30 at 22:55 +0900, Simon Horman wrote:
> Hi,
> 
> I just wanted to share what is a rather pleasing,
> though to me somewhat surprising result.
>
> I am testing bonding using balance-rr mode with three physical links to try
> to get > gigabit speed for a single stream. Why?  Because I'd like to run
> various tests at > gigabit speed and I don't have any 10G hardware at my
> disposal.
> 
> The result I have is that with a 1500 byte MTU, tcp_reordering=3 and both
> LSO and GSO disabled on both the sender and receiver I see:
> 
> # netperf -c -4 -t TCP_STREAM -H 172.17.60.216 -- -m 1472
> TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.17.60.216
> (172.17.60.216) port 0 AF_INET
> Recv   Send    Send                          Utilization       Service Demand
> Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
> Size   Size    Size     Time     Throughput  local    remote   local   remote
> bytes  bytes   bytes    secs.    10^6bits/s  % S      % U      us/KB   us/KB
> 
>   87380  16384   1472    10.01      1646.13   40.01    -1.00    3.982  -1.000
> 
> But with GRO enabled on the receiver I see.
> 
> # netperf -c -4 -t TCP_STREAM -H 172.17.60.216 -- -m 1472
> TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.17.60.216
> (172.17.60.216) port 0 AF_INET
> Recv   Send    Send                          Utilization       Service Demand
> Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
> Size   Size    Size     Time     Throughput  local    remote   local   remote
> bytes  bytes   bytes    secs.    10^6bits/s  % S      % U      us/KB   us/KB
> 
>  87380  16384   1472    10.01      2613.83   19.32    -1.00    1.211   -1.000
> 
> Which is much better than any result I get tweaking tcp_reordering when
> GRO is disabled on the receiver.

Did you also enable TSO/GSO on the sender?

What TSO/GSO will do is to change the round-robin scheduling from one
packet per interface to one super-packet per interface.  GRO then
coalesces the physical packets back into a super-packet.  The intervals
between receiving super-packets then tend to exceed the difference in
delay between interfaces, hiding the reordering.

If you only enabled GRO then I don't understand this.

> Tweaking tcp_reordering when GRO is enabled on the receiver seems to have
> negligible effect.  Which is interesting, because my brief reading on the
> subject indicated that tcp_reordering was the key tuning parameter for
> bonding with balance-rr.
> 
> The only other parameter that seemed to have significant effect was to
> increase the mtu.  In the case of MTU=9000, GRO seemed to have a negative
> impact on throughput, though a significant positive effect on CPU
> utilisation.
[...]

Increasing MTU also increases the interval between packets on a TCP flow
using maximum segment size so that it is more likely to exceed the
difference in delay.

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Bonding, GRO and tcp_reordering
  2010-11-30 15:42 ` Ben Hutchings
@ 2010-11-30 16:04   ` Eric Dumazet
  2010-12-01  4:34     ` Simon Horman
  2010-12-01  4:31   ` Simon Horman
  1 sibling, 1 reply; 12+ messages in thread
From: Eric Dumazet @ 2010-11-30 16:04 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: Simon Horman, netdev

Le mardi 30 novembre 2010 à 15:42 +0000, Ben Hutchings a écrit :
> On Tue, 2010-11-30 at 22:55 +0900, Simon Horman wrote:

> > The only other parameter that seemed to have significant effect was to
> > increase the mtu.  In the case of MTU=9000, GRO seemed to have a negative
> > impact on throughput, though a significant positive effect on CPU
> > utilisation.
> [...]
> 
> Increasing MTU also increases the interval between packets on a TCP flow
> using maximum segment size so that it is more likely to exceed the
> difference in delay.
> 

GRO really is operational _if_ we receive in same NAPI run several
packets for the same flow.

As soon as we exit NAPI mode, GRO packets are flushed.

Big MTU --> bigger delays between packets, so big chance that GRO cannot
trigger at all, since NAPI runs for one packet only.

One possibility with big MTU is to tweak "ethtool -c eth0" params
rx-usecs: 20
rx-frames: 5
rx-usecs-irq: 0
rx-frames-irq: 5
so that "rx-usecs" is bigger than the delay between two MTU full sized
packets.

Gigabit speed means 1 nano second per bit, and MTU=9000 means 72 us
delay between packets.

So try :

ethtool -C eth0 rx-usecs 100

to get chance that several packets are delivered at once by NIC.

Unfortunately, this also add some latency, so it helps bulk transferts,
and slowdown interactive traffic 



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Bonding, GRO and tcp_reordering
  2010-11-30 13:55 Bonding, GRO and tcp_reordering Simon Horman
  2010-11-30 15:42 ` Ben Hutchings
@ 2010-11-30 17:56 ` Rick Jones
  2010-11-30 18:14   ` Eric Dumazet
  2010-12-01  4:30   ` Simon Horman
  1 sibling, 2 replies; 12+ messages in thread
From: Rick Jones @ 2010-11-30 17:56 UTC (permalink / raw)
  To: Simon Horman; +Cc: netdev

Simon Horman wrote:
> Hi,
> 
> I just wanted to share what is a rather pleasing,
> though to me somewhat surprising result.
> 
> I am testing bonding using balance-rr mode with three physical links to try
> to get > gigabit speed for a single stream. Why?  Because I'd like to run
> various tests at > gigabit speed and I don't have any 10G hardware at my
> disposal.
> 
> The result I have is that with a 1500 byte MTU, tcp_reordering=3 and both
> LSO and GSO disabled on both the sender and receiver I see:
> 
> # netperf -c -4 -t TCP_STREAM -H 172.17.60.216 -- -m 1472

Why 1472 bytes per send?  If you wanted a 1-1 between the send size and the MSS, 
I would guess that 1448 would have been in order.  1472 would be the maximum 
data payload for a UDP/IPv4 datagram.  TCP will have more header than UDP.

> TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.17.60.216
> (172.17.60.216) port 0 AF_INET
> Recv   Send    Send                          Utilization       Service Demand
> Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
> Size   Size    Size     Time     Throughput  local    remote   local   remote
> bytes  bytes   bytes    secs.    10^6bits/s  % S      % U      us/KB   us/KB
> 
>   87380  16384   1472    10.01      1646.13   40.01    -1.00    3.982  -1.000
> 
> But with GRO enabled on the receiver I see.
> 
> # netperf -c -4 -t TCP_STREAM -H 172.17.60.216 -- -m 1472
> TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.17.60.216
> (172.17.60.216) port 0 AF_INET
> Recv   Send    Send                          Utilization       Service Demand
> Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
> Size   Size    Size     Time     Throughput  local    remote   local   remote
> bytes  bytes   bytes    secs.    10^6bits/s  % S      % U      us/KB   us/KB
> 
>  87380  16384   1472    10.01      2613.83   19.32    -1.00    1.211   -1.000

If you are changing things on the receiver, you should probably enable remote 
CPU utilization measurement with the -C option.

> Which is much better than any result I get tweaking tcp_reordering when
> GRO is disabled on the receiver.
> 
> Tweaking tcp_reordering when GRO is enabled on the receiver seems to have
> negligible effect.  Which is interesting, because my brief reading on the
> subject indicated that tcp_reordering was the key tuning parameter for
> bonding with balance-rr.

You are in a maze of twisty heuristics and algorithms, all interacting :)  If 
there are only three links in the bond, I suspect the chances for spurrious fast 
retransmission are somewhat smaller than if you had say four, based on just 
hand-waving on three duplicate ACKs requires receipt of perhaps four out of 
order segments.

> The only other parameter that seemed to have significant effect was to
> increase the mtu.  In the case of MTU=9000, GRO seemed to have a negative
> impact on throughput, though a significant positive effect on CPU
> utilisation.
> 
> MTU=9000, sender,receiver:tcp_reordering=3(default), receiver:GRO=off
> netperf -c -4 -t TCP_STREAM -H 172.17.60.216 -- -m 9872

9872?

> Recv   Send    Send                          Utilization       Service Demand
> Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
> Size   Size    Size     Time     Throughput  local    remote   local   remote
> bytes  bytes   bytes    secs.    10^6bits/s  % S      % U      us/KB   us/KB
> 
>  87380  16384   9872    10.01      2957.52   14.89    -1.00    0.825   -1.000
> 
> MTU=9000, sender,receiver:tcp_reordering=3(default), receiver:GRO=on
> netperf -c -4 -t TCP_STREAM -H 172.17.60.216 -- -m 9872
> Recv   Send    Send                          Utilization       Service Demand
> Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
> Size   Size    Size     Time     Throughput  local    remote   local   remote
> bytes  bytes   bytes    secs.    10^6bits/s  % S      % U      us/KB   us/KB
> 
>  87380  16384   9872    10.01      2847.64   10.84    -1.00    0.624   -1.000

Short of packet traces, taking snapshots of netstat statistics before and after 
each netperf run might be goodness - you can look at things like ratio of ACKs 
to data segments/bytes and such.  LRO/GRO can have a non-trivial effect on the 
number of ACKs, and ACKs are what matter for fast retransmit.

netstat -s > before
netperf ...
netstat -s > after
beforeafter before after > delta

where beforeafter comes (for now, the site will have to go away before long as 
the campus on which it is located has been sold) 
ftp://ftp.cup.hp.com/dist/networking/tools/  and will subtract before from after.

happy benchmarking,

rick jones

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Bonding, GRO and tcp_reordering
  2010-11-30 17:56 ` Rick Jones
@ 2010-11-30 18:14   ` Eric Dumazet
  2010-12-01  4:30   ` Simon Horman
  1 sibling, 0 replies; 12+ messages in thread
From: Eric Dumazet @ 2010-11-30 18:14 UTC (permalink / raw)
  To: Rick Jones; +Cc: Simon Horman, netdev

Le mardi 30 novembre 2010 à 09:56 -0800, Rick Jones a écrit :

> Short of packet traces, taking snapshots of netstat statistics before and after 
> each netperf run might be goodness - you can look at things like ratio of ACKs 
> to data segments/bytes and such.  LRO/GRO can have a non-trivial effect on the 
> number of ACKs, and ACKs are what matter for fast retransmit.
> 
> netstat -s > before
> netperf ...
> netstat -s > after
> beforeafter before after > delta
> 
> where beforeafter comes (for now, the site will have to go away before long as 
> the campus on which it is located has been sold) 
> ftp://ftp.cup.hp.com/dist/networking/tools/  and will subtract before from after.
> 
> happy benchmarking,

Yes indeed. With fast enough medium (or small MTUS), we can enter in a
backlog processing problem {filling huge receive queues}, as seen on
loopback lately...

netstat -s can show some receive queue overrun in this case.

    TCPBacklogDrop: xxx




^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Bonding, GRO and tcp_reordering
  2010-11-30 17:56 ` Rick Jones
  2010-11-30 18:14   ` Eric Dumazet
@ 2010-12-01  4:30   ` Simon Horman
  2010-12-01 19:42     ` Rick Jones
  1 sibling, 1 reply; 12+ messages in thread
From: Simon Horman @ 2010-12-01  4:30 UTC (permalink / raw)
  To: Rick Jones; +Cc: netdev

On Tue, Nov 30, 2010 at 09:56:02AM -0800, Rick Jones wrote:
> Simon Horman wrote:
> >Hi,
> >
> >I just wanted to share what is a rather pleasing,
> >though to me somewhat surprising result.
> >
> >I am testing bonding using balance-rr mode with three physical links to try
> >to get > gigabit speed for a single stream. Why?  Because I'd like to run
> >various tests at > gigabit speed and I don't have any 10G hardware at my
> >disposal.
> >
> >The result I have is that with a 1500 byte MTU, tcp_reordering=3 and both
> >LSO and GSO disabled on both the sender and receiver I see:
> >
> ># netperf -c -4 -t TCP_STREAM -H 172.17.60.216 -- -m 1472
> 
> Why 1472 bytes per send?  If you wanted a 1-1 between the send size
> and the MSS, I would guess that 1448 would have been in order.  1472
> would be the maximum data payload for a UDP/IPv4 datagram.  TCP will
> have more header than UDP.

Only to be consistent with UDP testing that I was doing at the same time.
I'll re-test with 1448.

> 
> >TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.17.60.216
> >(172.17.60.216) port 0 AF_INET
> >Recv   Send    Send                          Utilization       Service Demand
> >Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
> >Size   Size    Size     Time     Throughput  local    remote   local   remote
> >bytes  bytes   bytes    secs.    10^6bits/s  % S      % U      us/KB   us/KB
> >
> >  87380  16384   1472    10.01      1646.13   40.01    -1.00    3.982  -1.000
> >
> >But with GRO enabled on the receiver I see.
> >
> ># netperf -c -4 -t TCP_STREAM -H 172.17.60.216 -- -m 1472
> >TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.17.60.216
> >(172.17.60.216) port 0 AF_INET
> >Recv   Send    Send                          Utilization       Service Demand
> >Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
> >Size   Size    Size     Time     Throughput  local    remote   local   remote
> >bytes  bytes   bytes    secs.    10^6bits/s  % S      % U      us/KB   us/KB
> >
> > 87380  16384   1472    10.01      2613.83   19.32    -1.00    1.211   -1.000
> 
> If you are changing things on the receiver, you should probably
> enable remote CPU utilization measurement with the -C option.

Thanks, I will do so.

> >Which is much better than any result I get tweaking tcp_reordering when
> >GRO is disabled on the receiver.
> >
> >Tweaking tcp_reordering when GRO is enabled on the receiver seems to have
> >negligible effect.  Which is interesting, because my brief reading on the
> >subject indicated that tcp_reordering was the key tuning parameter for
> >bonding with balance-rr.
> 
> You are in a maze of twisty heuristics and algorithms, all
> interacting :)  If there are only three links in the bond, I suspect
> the chances for spurrious fast retransmission are somewhat smaller
> than if you had say four, based on just hand-waving on three
> duplicate ACKs requires receipt of perhaps four out of order
> segments.

Unfortunately NIC/slot availability only stretches to three links :-(
If you think its really worthwhile I can obtain some more dual-port nics.

> >The only other parameter that seemed to have significant effect was to
> >increase the mtu.  In the case of MTU=9000, GRO seemed to have a negative
> >impact on throughput, though a significant positive effect on CPU
> >utilisation.
> >
> >MTU=9000, sender,receiver:tcp_reordering=3(default), receiver:GRO=off
> >netperf -c -4 -t TCP_STREAM -H 172.17.60.216 -- -m 9872
> 
> 9872?

It should have been 8972, I'll retest with 8948 as per your suggestion above.

> >Recv   Send    Send                          Utilization       Service Demand
> >Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
> >Size   Size    Size     Time     Throughput  local    remote   local   remote
> >bytes  bytes   bytes    secs.    10^6bits/s  % S      % U      us/KB   us/KB
> >
> > 87380  16384   9872    10.01      2957.52   14.89    -1.00    0.825   -1.000
> >
> >MTU=9000, sender,receiver:tcp_reordering=3(default), receiver:GRO=on
> >netperf -c -4 -t TCP_STREAM -H 172.17.60.216 -- -m 9872
> >Recv   Send    Send                          Utilization       Service Demand
> >Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
> >Size   Size    Size     Time     Throughput  local    remote   local   remote
> >bytes  bytes   bytes    secs.    10^6bits/s  % S      % U      us/KB   us/KB
> >
> > 87380  16384   9872    10.01      2847.64   10.84    -1.00    0.624   -1.000
> 
> Short of packet traces, taking snapshots of netstat statistics
> before and after each netperf run might be goodness - you can look
> at things like ratio of ACKs to data segments/bytes and such.
> LRO/GRO can have a non-trivial effect on the number of ACKs, and
> ACKs are what matter for fast retransmit.
> 
> netstat -s > before
> netperf ...
> netstat -s > after
> beforeafter before after > delta
> 
> where beforeafter comes (for now, the site will have to go away
> before long as the campus on which it is located has been sold)
> ftp://ftp.cup.hp.com/dist/networking/tools/  and will subtract
> before from after.

Thanks, I'll take a look into that.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Bonding, GRO and tcp_reordering
  2010-11-30 15:42 ` Ben Hutchings
  2010-11-30 16:04   ` Eric Dumazet
@ 2010-12-01  4:31   ` Simon Horman
  1 sibling, 0 replies; 12+ messages in thread
From: Simon Horman @ 2010-12-01  4:31 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: netdev

On Tue, Nov 30, 2010 at 03:42:56PM +0000, Ben Hutchings wrote:
> On Tue, 2010-11-30 at 22:55 +0900, Simon Horman wrote:
> > Hi,
> > 
> > I just wanted to share what is a rather pleasing,
> > though to me somewhat surprising result.
> >
> > I am testing bonding using balance-rr mode with three physical links to try
> > to get > gigabit speed for a single stream. Why?  Because I'd like to run
> > various tests at > gigabit speed and I don't have any 10G hardware at my
> > disposal.
> > 
> > The result I have is that with a 1500 byte MTU, tcp_reordering=3 and both
> > LSO and GSO disabled on both the sender and receiver I see:
> > 
> > # netperf -c -4 -t TCP_STREAM -H 172.17.60.216 -- -m 1472
> > TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.17.60.216
> > (172.17.60.216) port 0 AF_INET
> > Recv   Send    Send                          Utilization       Service Demand
> > Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
> > Size   Size    Size     Time     Throughput  local    remote   local   remote
> > bytes  bytes   bytes    secs.    10^6bits/s  % S      % U      us/KB   us/KB
> > 
> >   87380  16384   1472    10.01      1646.13   40.01    -1.00    3.982  -1.000
> > 
> > But with GRO enabled on the receiver I see.
> > 
> > # netperf -c -4 -t TCP_STREAM -H 172.17.60.216 -- -m 1472
> > TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.17.60.216
> > (172.17.60.216) port 0 AF_INET
> > Recv   Send    Send                          Utilization       Service Demand
> > Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
> > Size   Size    Size     Time     Throughput  local    remote   local   remote
> > bytes  bytes   bytes    secs.    10^6bits/s  % S      % U      us/KB   us/KB
> > 
> >  87380  16384   1472    10.01      2613.83   19.32    -1.00    1.211   -1.000
> > 
> > Which is much better than any result I get tweaking tcp_reordering when
> > GRO is disabled on the receiver.
> 
> Did you also enable TSO/GSO on the sender?

It didn't seem to make any difference either way.
I'll re-test just in case I missed something.

> 
> What TSO/GSO will do is to change the round-robin scheduling from one
> packet per interface to one super-packet per interface.  GRO then
> coalesces the physical packets back into a super-packet.  The intervals
> between receiving super-packets then tend to exceed the difference in
> delay between interfaces, hiding the reordering.
> 
> If you only enabled GRO then I don't understand this.
> 
> > Tweaking tcp_reordering when GRO is enabled on the receiver seems to have
> > negligible effect.  Which is interesting, because my brief reading on the
> > subject indicated that tcp_reordering was the key tuning parameter for
> > bonding with balance-rr.
> > 
> > The only other parameter that seemed to have significant effect was to
> > increase the mtu.  In the case of MTU=9000, GRO seemed to have a negative
> > impact on throughput, though a significant positive effect on CPU
> > utilisation.
> [...]
> 
> Increasing MTU also increases the interval between packets on a TCP flow
> using maximum segment size so that it is more likely to exceed the
> difference in delay.

I hadn't considered that, thanks.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Bonding, GRO and tcp_reordering
  2010-11-30 16:04   ` Eric Dumazet
@ 2010-12-01  4:34     ` Simon Horman
  2010-12-01  4:47       ` Eric Dumazet
  2010-12-03 13:38       ` Simon Horman
  0 siblings, 2 replies; 12+ messages in thread
From: Simon Horman @ 2010-12-01  4:34 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Ben Hutchings, netdev

On Tue, Nov 30, 2010 at 05:04:33PM +0100, Eric Dumazet wrote:
> Le mardi 30 novembre 2010 à 15:42 +0000, Ben Hutchings a écrit :
> > On Tue, 2010-11-30 at 22:55 +0900, Simon Horman wrote:
> 
> > > The only other parameter that seemed to have significant effect was to
> > > increase the mtu.  In the case of MTU=9000, GRO seemed to have a negative
> > > impact on throughput, though a significant positive effect on CPU
> > > utilisation.
> > [...]
> > 
> > Increasing MTU also increases the interval between packets on a TCP flow
> > using maximum segment size so that it is more likely to exceed the
> > difference in delay.
> > 
> 
> GRO really is operational _if_ we receive in same NAPI run several
> packets for the same flow.
> 
> As soon as we exit NAPI mode, GRO packets are flushed.
> 
> Big MTU --> bigger delays between packets, so big chance that GRO cannot
> trigger at all, since NAPI runs for one packet only.
> 
> One possibility with big MTU is to tweak "ethtool -c eth0" params
> rx-usecs: 20
> rx-frames: 5
> rx-usecs-irq: 0
> rx-frames-irq: 5
> so that "rx-usecs" is bigger than the delay between two MTU full sized
> packets.
> 
> Gigabit speed means 1 nano second per bit, and MTU=9000 means 72 us
> delay between packets.
> 
> So try :
> 
> ethtool -C eth0 rx-usecs 100
> 
> to get chance that several packets are delivered at once by NIC.
> 
> Unfortunately, this also add some latency, so it helps bulk transferts,
> and slowdown interactive traffic 

Thanks Eric,

I was tweaking those values recently for some latency tuning
but I didn't think of them in relation to last night's tests.

In terms of my measurements, its just benchmarking at this stage.
So a trade-off between throughput and latency is acceptable, so long
as I remember to measure what it is.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Bonding, GRO and tcp_reordering
  2010-12-01  4:34     ` Simon Horman
@ 2010-12-01  4:47       ` Eric Dumazet
  2010-12-02  6:39         ` Simon Horman
  2010-12-03 13:38       ` Simon Horman
  1 sibling, 1 reply; 12+ messages in thread
From: Eric Dumazet @ 2010-12-01  4:47 UTC (permalink / raw)
  To: Simon Horman; +Cc: Ben Hutchings, netdev

Le mercredi 01 décembre 2010 à 13:34 +0900, Simon Horman a écrit :

> I was tweaking those values recently for some latency tuning
> but I didn't think of them in relation to last night's tests.
> 
> In terms of my measurements, its just benchmarking at this stage.
> So a trade-off between throughput and latency is acceptable, so long
> as I remember to measure what it is.
> 

I was thinking again this morning about GRO and bonding, and dont know
if it actually works...

Is GRO on on individual eth0/eth1/eth2 you use, or on bonding device
itself ?




^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Bonding, GRO and tcp_reordering
  2010-12-01  4:30   ` Simon Horman
@ 2010-12-01 19:42     ` Rick Jones
  0 siblings, 0 replies; 12+ messages in thread
From: Rick Jones @ 2010-12-01 19:42 UTC (permalink / raw)
  To: Simon Horman; +Cc: netdev

>>You are in a maze of twisty heuristics and algorithms, all
>>interacting :)  If there are only three links in the bond, I suspect
>>the chances for spurrious fast retransmission are somewhat smaller
>>than if you had say four, based on just hand-waving on three
>>duplicate ACKs requires receipt of perhaps four out of order
>>segments.
> 
> 
> Unfortunately NIC/slot availability only stretches to three links :-(
> If you think its really worthwhile I can obtain some more dual-port nics.

Only if you want to increase the chances of reordering that triggers spurrious 
fast retransmits.

rick jones

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Bonding, GRO and tcp_reordering
  2010-12-01  4:47       ` Eric Dumazet
@ 2010-12-02  6:39         ` Simon Horman
  0 siblings, 0 replies; 12+ messages in thread
From: Simon Horman @ 2010-12-02  6:39 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Ben Hutchings, netdev

On Wed, Dec 01, 2010 at 05:47:06AM +0100, Eric Dumazet wrote:
> Le mercredi 01 décembre 2010 à 13:34 +0900, Simon Horman a écrit :
> 
> > I was tweaking those values recently for some latency tuning
> > but I didn't think of them in relation to last night's tests.
> > 
> > In terms of my measurements, its just benchmarking at this stage.
> > So a trade-off between throughput and latency is acceptable, so long
> > as I remember to measure what it is.
> > 
> 
> I was thinking again this morning about GRO and bonding, and dont know
> if it actually works...
> 
> Is GRO on on individual eth0/eth1/eth2 you use, or on bonding device
> itself ?

All of the above. I can check different combinations if it helps.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Bonding, GRO and tcp_reordering
  2010-12-01  4:34     ` Simon Horman
  2010-12-01  4:47       ` Eric Dumazet
@ 2010-12-03 13:38       ` Simon Horman
  1 sibling, 0 replies; 12+ messages in thread
From: Simon Horman @ 2010-12-03 13:38 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Ben Hutchings, netdev

On Wed, Dec 01, 2010 at 01:34:45PM +0900, Simon Horman wrote:
> On Tue, Nov 30, 2010 at 05:04:33PM +0100, Eric Dumazet wrote:
> > Le mardi 30 novembre 2010 à 15:42 +0000, Ben Hutchings a écrit :
> > > On Tue, 2010-11-30 at 22:55 +0900, Simon Horman wrote:

To clarify my statement in a previous email that GSO had no effect: I
re-ran the tests and I still haven't observed any affect of GSO on my
results. However, I did notice that in order for GRO on the server to have
effect I also need TSO enabled on the client.  I thought that I had
previously checked that but I was mistaken.

Enabling TSO on the client while leaving GSO disabled on the server
resulted in increased CPU utilisation on the client, from ~15% to ~20%.

> > > > The only other parameter that seemed to have significant effect was to
> > > > increase the mtu.  In the case of MTU=9000, GRO seemed to have a negative
> > > > impact on throughput, though a significant positive effect on CPU
> > > > utilisation.
> > > [...]
> > > 
> > > Increasing MTU also increases the interval between packets on a TCP flow
> > > using maximum segment size so that it is more likely to exceed the
> > > difference in delay.
> > > 
> > 
> > GRO really is operational _if_ we receive in same NAPI run several
> > packets for the same flow.
> > 
> > As soon as we exit NAPI mode, GRO packets are flushed.
> > 
> > Big MTU --> bigger delays between packets, so big chance that GRO cannot
> > trigger at all, since NAPI runs for one packet only.
> > 
> > One possibility with big MTU is to tweak "ethtool -c eth0" params
> > rx-usecs: 20
> > rx-frames: 5
> > rx-usecs-irq: 0
> > rx-frames-irq: 5
> > so that "rx-usecs" is bigger than the delay between two MTU full sized
> > packets.
> > 
> > Gigabit speed means 1 nano second per bit, and MTU=9000 means 72 us
> > delay between packets.
> > 
> > So try :
> > 
> > ethtool -C eth0 rx-usecs 100
> > 
> > to get chance that several packets are delivered at once by NIC.
> > 
> > Unfortunately, this also add some latency, so it helps bulk transferts,
> > and slowdown interactive traffic 
> 
> Thanks Eric,
> 
> I was tweaking those values recently for some latency tuning
> but I didn't think of them in relation to last night's tests.
> 
> In terms of my measurements, its just benchmarking at this stage.
> So a trade-off between throughput and latency is acceptable, so long
> as I remember to measure what it is.

Thanks, rx-usecs was set to 3 and changing it to 15 on the server
did seem increase throughput with 1500 byte packets. Although
CPU utilisation increased too, disproportionally so on the client.

MTU=1500, client,server:tcp_reordering=3, client:GSO=off,
	client:TSO=on, server:GRO=off, server:rx-usecs=3(default)
# netperf -c -4 -t TCP_STREAM -H 172.17.60.216
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.17.60.216 (172.17.60.216) port 0 AF_INET
Recv   Send    Send                          Utilization       Service Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % U      us/KB   us/KB

 87380  16384  16384    10.00      1591.34   16.35    5.80     1.683   2.390

MTU=1500, client,server:tcp_reordering=3(default), client:GSO=off,
	client:TSO=on, server:GRO=off server:rx-usecs=15
# netperf -c -4 -t TCP_STREAM -H 172.17.60.216
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.17.60.216 (172.17.60.216) port 0 AF_INET
Recv   Send    Send                          Utilization       Service Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % U      us/KB   us/KB

 87380  16384  16384    10.00      1774.38   23.75    7.58     2.193   2.801

I also saw an improvement with GRO enabled on the server and TSO enabled on
the client.  Although in this case I found rx-usecs=45 to be the best
value.

MTU=1500, client,server:tcp_reordering=3(default), client:GSO=off,
	client:TSO=on, server:GRO=on server:rx-usecs=3(default)
# netperf -c -4 -t TCP_STREAM -H 172.17.60.216
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.17.60.216 (172.17.60.216) port 0 AF_INET
Recv   Send    Send                          Utilization       Service Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % U      us/KB   us/KB

 87380  16384  16384    10.00      2553.27   13.31    3.35     0.854   0.860

MTU=1500, client,server:tcp_reordering=3(default), client:GSO=off,
	client:TSO=on, server:GRO=on server:rx-usecs=45
# netperf -c -4 -t TCP_STREAM -H 172.17.60.216
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.17.60.216 (172.17.60.216) port 0 AF_INET
Recv   Send    Send                          Utilization       Service Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % U      us/KB   us/KB

 87380  16384  16384    10.00      2727.53   29.45    9.48     1.769   2.278

I did not observe any improvement in throughput when increasing rx-usecs
from 3 when using mtu=9000 although there was a slight increase in CPU
utilisation (maybe, there is quite a lot of noise in the results).


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2010-12-03 13:38 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-11-30 13:55 Bonding, GRO and tcp_reordering Simon Horman
2010-11-30 15:42 ` Ben Hutchings
2010-11-30 16:04   ` Eric Dumazet
2010-12-01  4:34     ` Simon Horman
2010-12-01  4:47       ` Eric Dumazet
2010-12-02  6:39         ` Simon Horman
2010-12-03 13:38       ` Simon Horman
2010-12-01  4:31   ` Simon Horman
2010-11-30 17:56 ` Rick Jones
2010-11-30 18:14   ` Eric Dumazet
2010-12-01  4:30   ` Simon Horman
2010-12-01 19:42     ` Rick Jones

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.