All of lore.kernel.org
 help / color / mirror / Atom feed
* [BUG] ixgbe: something wrong with queue selection ?
@ 2012-04-17  9:06 Eric Dumazet
  2012-04-17  9:16 ` Jeff Kirsher
  0 siblings, 1 reply; 32+ messages in thread
From: Eric Dumazet @ 2012-04-17  9:06 UTC (permalink / raw)
  To: Kirsher, Jeffrey T, Greg Rose, John Fastabend, Jesse Brandeburg; +Cc: netdev

Hi guys

I have bad feelings with ixgbe and its multiqueue selection.

On a quad core machine (Q6600), I get lots of reorderings on a single
TCP stream.


Apparently packets happily spread on all available queues, instead of a
single one.

This defeats GRO at receiver and TCP performance is really bad.

# ethtool -S eth5|egrep "x_queue_[0123]_packets" ; taskset 1 netperf -H
192.168.99.1 ; ethtool -S eth5|egrep "x_queue_[0123]_packets"
     tx_queue_0_packets: 24
     tx_queue_1_packets: 26
     tx_queue_2_packets: 32
     tx_queue_3_packets: 16
     rx_queue_0_packets: 11
     rx_queue_1_packets: 47
     rx_queue_2_packets: 27
     rx_queue_3_packets: 22
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
192.168.99.1 (192.168.99.1) port 0 AF_INET
Recv   Send    Send                          
Socket Socket  Message  Elapsed              
Size   Size    Size     Time     Throughput  
bytes  bytes   bytes    secs.    10^6bits/sec  

 87380  16384  16384    10.00    3866.43   
     tx_queue_0_packets: 1653201
     tx_queue_1_packets: 608000
     tx_queue_2_packets: 541382
     tx_queue_3_packets: 536543
     rx_queue_0_packets: 434703
     rx_queue_1_packets: 137444
     rx_queue_2_packets: 131023
     rx_queue_3_packets: 128407

# ip ro get 192.168.99.1
192.168.99.1 dev eth5  src 192.168.99.2 
    cache  ipid 0x438b rtt 4ms rttvar 4ms cwnd 57 reordering 127

# lspci -v -s 02:00.0
02:00.0 Ethernet controller: Intel Corporation 82599EB 10-Gigabit
SFI/SFP+ Network Connection (rev 01)
	Subsystem: Intel Corporation Ethernet Server Adapter X520-2
	Flags: bus master, fast devsel, latency 0, IRQ 16
	Memory at f1100000 (64-bit, prefetchable) [size=512K]
	I/O ports at b000 [size=32]
	Memory at f1200000 (64-bit, prefetchable) [size=16K]
	Capabilities: [40] Power Management version 3
	Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
	Capabilities: [70] MSI-X: Enable+ Count=64 Masked-
	Capabilities: [a0] Express Endpoint, MSI 00
	Capabilities: [100] Advanced Error Reporting
	Capabilities: [140] Device Serial Number 00-1b-21-ff-ff-4a-fe-54
	Capabilities: [150] Alternative Routing-ID Interpretation (ARI)
	Capabilities: [160] Single Root I/O Virtualization (SR-IOV)
	Kernel driver in use: ixgbe
	Kernel modules: ixgbe

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [BUG] ixgbe: something wrong with queue selection ?
  2012-04-17  9:06 [BUG] ixgbe: something wrong with queue selection ? Eric Dumazet
@ 2012-04-17  9:16 ` Jeff Kirsher
  2012-04-17 16:01   ` Alexander Duyck
  0 siblings, 1 reply; 32+ messages in thread
From: Jeff Kirsher @ 2012-04-17  9:16 UTC (permalink / raw)
  To: Eric Dumazet, Skidmore, Donald C, Duyck, Alexander H
  Cc: Greg Rose, John Fastabend, Jesse Brandeburg, netdev

[-- Attachment #1: Type: text/plain, Size: 2553 bytes --]

On Tue, 2012-04-17 at 11:06 +0200, Eric Dumazet wrote:
> Hi guys
> 
> I have bad feelings with ixgbe and its multiqueue selection.
> 
> On a quad core machine (Q6600), I get lots of reorderings on a single
> TCP stream.
> 
> 
> Apparently packets happily spread on all available queues, instead of a
> single one.
> 
> This defeats GRO at receiver and TCP performance is really bad.
> 
> # ethtool -S eth5|egrep "x_queue_[0123]_packets" ; taskset 1 netperf -H
> 192.168.99.1 ; ethtool -S eth5|egrep "x_queue_[0123]_packets"
>      tx_queue_0_packets: 24
>      tx_queue_1_packets: 26
>      tx_queue_2_packets: 32
>      tx_queue_3_packets: 16
>      rx_queue_0_packets: 11
>      rx_queue_1_packets: 47
>      rx_queue_2_packets: 27
>      rx_queue_3_packets: 22
> MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
> 192.168.99.1 (192.168.99.1) port 0 AF_INET
> Recv   Send    Send                          
> Socket Socket  Message  Elapsed              
> Size   Size    Size     Time     Throughput  
> bytes  bytes   bytes    secs.    10^6bits/sec  
> 
>  87380  16384  16384    10.00    3866.43   
>      tx_queue_0_packets: 1653201
>      tx_queue_1_packets: 608000
>      tx_queue_2_packets: 541382
>      tx_queue_3_packets: 536543
>      rx_queue_0_packets: 434703
>      rx_queue_1_packets: 137444
>      rx_queue_2_packets: 131023
>      rx_queue_3_packets: 128407
> 
> # ip ro get 192.168.99.1
> 192.168.99.1 dev eth5  src 192.168.99.2 
>     cache  ipid 0x438b rtt 4ms rttvar 4ms cwnd 57 reordering 127
> 
> # lspci -v -s 02:00.0
> 02:00.0 Ethernet controller: Intel Corporation 82599EB 10-Gigabit
> SFI/SFP+ Network Connection (rev 01)
> 	Subsystem: Intel Corporation Ethernet Server Adapter X520-2
> 	Flags: bus master, fast devsel, latency 0, IRQ 16
> 	Memory at f1100000 (64-bit, prefetchable) [size=512K]
> 	I/O ports at b000 [size=32]
> 	Memory at f1200000 (64-bit, prefetchable) [size=16K]
> 	Capabilities: [40] Power Management version 3
> 	Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
> 	Capabilities: [70] MSI-X: Enable+ Count=64 Masked-
> 	Capabilities: [a0] Express Endpoint, MSI 00
> 	Capabilities: [100] Advanced Error Reporting
> 	Capabilities: [140] Device Serial Number 00-1b-21-ff-ff-4a-fe-54
> 	Capabilities: [150] Alternative Routing-ID Interpretation (ARI)
> 	Capabilities: [160] Single Root I/O Virtualization (SR-IOV)
> 	Kernel driver in use: ixgbe
> 	Kernel modules: ixgbe
> 
> 

Adding Don Skidmore and Alex Duyck...

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [BUG] ixgbe: something wrong with queue selection ?
  2012-04-17  9:16 ` Jeff Kirsher
@ 2012-04-17 16:01   ` Alexander Duyck
  2012-04-17 16:38     ` John Fastabend
  2012-04-17 16:46     ` Eric Dumazet
  0 siblings, 2 replies; 32+ messages in thread
From: Alexander Duyck @ 2012-04-17 16:01 UTC (permalink / raw)
  To: jeffrey.t.kirsher
  Cc: Eric Dumazet, Skidmore, Donald C, Greg Rose, John Fastabend,
	Jesse Brandeburg, netdev

On 04/17/2012 02:16 AM, Jeff Kirsher wrote:
> On Tue, 2012-04-17 at 11:06 +0200, Eric Dumazet wrote:
>> Hi guys
>>
>> I have bad feelings with ixgbe and its multiqueue selection.
>>
>> On a quad core machine (Q6600), I get lots of reorderings on a single
>> TCP stream.
>>
>>
>> Apparently packets happily spread on all available queues, instead of a
>> single one.
>>
>> This defeats GRO at receiver and TCP performance is really bad.
>>
>> # ethtool -S eth5|egrep "x_queue_[0123]_packets" ; taskset 1 netperf -H
>> 192.168.99.1 ; ethtool -S eth5|egrep "x_queue_[0123]_packets"
>>      tx_queue_0_packets: 24
>>      tx_queue_1_packets: 26
>>      tx_queue_2_packets: 32
>>      tx_queue_3_packets: 16
>>      rx_queue_0_packets: 11
>>      rx_queue_1_packets: 47
>>      rx_queue_2_packets: 27
>>      rx_queue_3_packets: 22
>> MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
>> 192.168.99.1 (192.168.99.1) port 0 AF_INET
>> Recv   Send    Send                          
>> Socket Socket  Message  Elapsed              
>> Size   Size    Size     Time     Throughput  
>> bytes  bytes   bytes    secs.    10^6bits/sec  
>>
>>  87380  16384  16384    10.00    3866.43   
>>      tx_queue_0_packets: 1653201
>>      tx_queue_1_packets: 608000
>>      tx_queue_2_packets: 541382
>>      tx_queue_3_packets: 536543
>>      rx_queue_0_packets: 434703
>>      rx_queue_1_packets: 137444
>>      rx_queue_2_packets: 131023
>>      rx_queue_3_packets: 128407
>>
>> # ip ro get 192.168.99.1
>> 192.168.99.1 dev eth5  src 192.168.99.2 
>>     cache  ipid 0x438b rtt 4ms rttvar 4ms cwnd 57 reordering 127
>>
>> # lspci -v -s 02:00.0
>> 02:00.0 Ethernet controller: Intel Corporation 82599EB 10-Gigabit
>> SFI/SFP+ Network Connection (rev 01)
>> 	Subsystem: Intel Corporation Ethernet Server Adapter X520-2
>> 	Flags: bus master, fast devsel, latency 0, IRQ 16
>> 	Memory at f1100000 (64-bit, prefetchable) [size=512K]
>> 	I/O ports at b000 [size=32]
>> 	Memory at f1200000 (64-bit, prefetchable) [size=16K]
>> 	Capabilities: [40] Power Management version 3
>> 	Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
>> 	Capabilities: [70] MSI-X: Enable+ Count=64 Masked-
>> 	Capabilities: [a0] Express Endpoint, MSI 00
>> 	Capabilities: [100] Advanced Error Reporting
>> 	Capabilities: [140] Device Serial Number 00-1b-21-ff-ff-4a-fe-54
>> 	Capabilities: [150] Alternative Routing-ID Interpretation (ARI)
>> 	Capabilities: [160] Single Root I/O Virtualization (SR-IOV)
>> 	Kernel driver in use: ixgbe
>> 	Kernel modules: ixgbe
>>
>>
> Adding Don Skidmore and Alex Duyck...
This is probably the result of ATR and the load balancer on the system. 
What is likely happening is that the netperf process is getting moved
from CPU to CPU, and this is causing the transmit queue to change.  Once
this happens the ATR will cause the receive queue to change in order to
follow the transmitting process.

One thing you might try is using the "-T" option in netperf to see if
the behaviour occurs if the process is bound to a specific CPU.  Another
thing you might try would be to disable ATR by enabling ntuple.  You
should be able to do that with  "ethtool -K eth5 ntuple on".

Thanks,

Alex

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [BUG] ixgbe: something wrong with queue selection ?
  2012-04-17 16:01   ` Alexander Duyck
@ 2012-04-17 16:38     ` John Fastabend
  2012-04-17 17:07       ` Ben Hutchings
  2012-04-17 16:46     ` Eric Dumazet
  1 sibling, 1 reply; 32+ messages in thread
From: John Fastabend @ 2012-04-17 16:38 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: jeffrey.t.kirsher, Eric Dumazet, Skidmore, Donald C, Greg Rose,
	Jesse Brandeburg, netdev

On 4/17/2012 9:01 AM, Alexander Duyck wrote:
> On 04/17/2012 02:16 AM, Jeff Kirsher wrote:
>> On Tue, 2012-04-17 at 11:06 +0200, Eric Dumazet wrote:
>>> Hi guys
>>>
>>> I have bad feelings with ixgbe and its multiqueue selection.
>>>
>>> On a quad core machine (Q6600), I get lots of reorderings on a single
>>> TCP stream.
>>>
>>>
>>> Apparently packets happily spread on all available queues, instead of a
>>> single one.
>>>
>>> This defeats GRO at receiver and TCP performance is really bad.
>>>
>>> # ethtool -S eth5|egrep "x_queue_[0123]_packets" ; taskset 1 netperf -H
>>> 192.168.99.1 ; ethtool -S eth5|egrep "x_queue_[0123]_packets"
>>>      tx_queue_0_packets: 24
>>>      tx_queue_1_packets: 26
>>>      tx_queue_2_packets: 32
>>>      tx_queue_3_packets: 16
>>>      rx_queue_0_packets: 11
>>>      rx_queue_1_packets: 47
>>>      rx_queue_2_packets: 27
>>>      rx_queue_3_packets: 22
>>> MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
>>> 192.168.99.1 (192.168.99.1) port 0 AF_INET
>>> Recv   Send    Send                          
>>> Socket Socket  Message  Elapsed              
>>> Size   Size    Size     Time     Throughput  
>>> bytes  bytes   bytes    secs.    10^6bits/sec  
>>>
>>>  87380  16384  16384    10.00    3866.43   
>>>      tx_queue_0_packets: 1653201
>>>      tx_queue_1_packets: 608000
>>>      tx_queue_2_packets: 541382
>>>      tx_queue_3_packets: 536543
>>>      rx_queue_0_packets: 434703
>>>      rx_queue_1_packets: 137444
>>>      rx_queue_2_packets: 131023
>>>      rx_queue_3_packets: 128407
>>>
>>> # ip ro get 192.168.99.1
>>> 192.168.99.1 dev eth5  src 192.168.99.2 
>>>     cache  ipid 0x438b rtt 4ms rttvar 4ms cwnd 57 reordering 127
>>>
>>> # lspci -v -s 02:00.0
>>> 02:00.0 Ethernet controller: Intel Corporation 82599EB 10-Gigabit
>>> SFI/SFP+ Network Connection (rev 01)
>>> 	Subsystem: Intel Corporation Ethernet Server Adapter X520-2
>>> 	Flags: bus master, fast devsel, latency 0, IRQ 16
>>> 	Memory at f1100000 (64-bit, prefetchable) [size=512K]
>>> 	I/O ports at b000 [size=32]
>>> 	Memory at f1200000 (64-bit, prefetchable) [size=16K]
>>> 	Capabilities: [40] Power Management version 3
>>> 	Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
>>> 	Capabilities: [70] MSI-X: Enable+ Count=64 Masked-
>>> 	Capabilities: [a0] Express Endpoint, MSI 00
>>> 	Capabilities: [100] Advanced Error Reporting
>>> 	Capabilities: [140] Device Serial Number 00-1b-21-ff-ff-4a-fe-54
>>> 	Capabilities: [150] Alternative Routing-ID Interpretation (ARI)
>>> 	Capabilities: [160] Single Root I/O Virtualization (SR-IOV)
>>> 	Kernel driver in use: ixgbe
>>> 	Kernel modules: ixgbe
>>>
>>>
>> Adding Don Skidmore and Alex Duyck...
> This is probably the result of ATR and the load balancer on the system. 
> What is likely happening is that the netperf process is getting moved
> from CPU to CPU, and this is causing the transmit queue to change.  Once
> this happens the ATR will cause the receive queue to change in order to
> follow the transmitting process.
> 
> One thing you might try is using the "-T" option in netperf to see if
> the behaviour occurs if the process is bound to a specific CPU.  Another
> thing you might try would be to disable ATR by enabling ntuple.  You
> should be able to do that with  "ethtool -K eth5 ntuple on".
> 
> Thanks,
> 
> Alex
> --

We could consider using sk_tx_queue_get(sk) in select_queue or better
yet use XPS and initialize it at sw init time. I think this would work
nicely and finally remove the select_queue() logic I just haven't got
to it yet. Hoping to get there the next few weeks.

.John

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [BUG] ixgbe: something wrong with queue selection ?
  2012-04-17 16:01   ` Alexander Duyck
  2012-04-17 16:38     ` John Fastabend
@ 2012-04-17 16:46     ` Eric Dumazet
  2012-04-17 21:38       ` TSO not 10G friendly if peer is close enough Eric Dumazet
  1 sibling, 1 reply; 32+ messages in thread
From: Eric Dumazet @ 2012-04-17 16:46 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: jeffrey.t.kirsher, Skidmore, Donald C, Greg Rose, John Fastabend,
	Jesse Brandeburg, netdev

On Tue, 2012-04-17 at 09:01 -0700, Alexander Duyck wrote:
> On 04/17/2012 02:16 AM, Jeff Kirsher wrote:
> > On Tue, 2012-04-17 at 11:06 +0200, Eric Dumazet wrote:
> >> Hi guys
> >>
> >> I have bad feelings with ixgbe and its multiqueue selection.
> >>
> >> On a quad core machine (Q6600), I get lots of reorderings on a single
> >> TCP stream.
> >>
> >>
> >> Apparently packets happily spread on all available queues, instead of a
> >> single one.
> >>
> >> This defeats GRO at receiver and TCP performance is really bad.
> >>
> >> # ethtool -S eth5|egrep "x_queue_[0123]_packets" ; taskset 1 netperf -H
> >> 192.168.99.1 ; ethtool -S eth5|egrep "x_queue_[0123]_packets"
> >>      tx_queue_0_packets: 24
> >>      tx_queue_1_packets: 26
> >>      tx_queue_2_packets: 32
> >>      tx_queue_3_packets: 16
> >>      rx_queue_0_packets: 11
> >>      rx_queue_1_packets: 47
> >>      rx_queue_2_packets: 27
> >>      rx_queue_3_packets: 22
> >> MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
> >> 192.168.99.1 (192.168.99.1) port 0 AF_INET
> >> Recv   Send    Send                          
> >> Socket Socket  Message  Elapsed              
> >> Size   Size    Size     Time     Throughput  
> >> bytes  bytes   bytes    secs.    10^6bits/sec  
> >>
> >>  87380  16384  16384    10.00    3866.43   
> >>      tx_queue_0_packets: 1653201
> >>      tx_queue_1_packets: 608000
> >>      tx_queue_2_packets: 541382
> >>      tx_queue_3_packets: 536543
> >>      rx_queue_0_packets: 434703
> >>      rx_queue_1_packets: 137444
> >>      rx_queue_2_packets: 131023
> >>      rx_queue_3_packets: 128407
> >>
> >> # ip ro get 192.168.99.1
> >> 192.168.99.1 dev eth5  src 192.168.99.2 
> >>     cache  ipid 0x438b rtt 4ms rttvar 4ms cwnd 57 reordering 127
> >>
> >> # lspci -v -s 02:00.0
> >> 02:00.0 Ethernet controller: Intel Corporation 82599EB 10-Gigabit
> >> SFI/SFP+ Network Connection (rev 01)
> >> 	Subsystem: Intel Corporation Ethernet Server Adapter X520-2
> >> 	Flags: bus master, fast devsel, latency 0, IRQ 16
> >> 	Memory at f1100000 (64-bit, prefetchable) [size=512K]
> >> 	I/O ports at b000 [size=32]
> >> 	Memory at f1200000 (64-bit, prefetchable) [size=16K]
> >> 	Capabilities: [40] Power Management version 3
> >> 	Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
> >> 	Capabilities: [70] MSI-X: Enable+ Count=64 Masked-
> >> 	Capabilities: [a0] Express Endpoint, MSI 00
> >> 	Capabilities: [100] Advanced Error Reporting
> >> 	Capabilities: [140] Device Serial Number 00-1b-21-ff-ff-4a-fe-54
> >> 	Capabilities: [150] Alternative Routing-ID Interpretation (ARI)
> >> 	Capabilities: [160] Single Root I/O Virtualization (SR-IOV)
> >> 	Kernel driver in use: ixgbe
> >> 	Kernel modules: ixgbe
> >>
> >>
> > Adding Don Skidmore and Alex Duyck...
> This is probably the result of ATR and the load balancer on the system. 
> What is likely happening is that the netperf process is getting moved
> from CPU to CPU, and this is causing the transmit queue to change.  Once
> this happens the ATR will cause the receive queue to change in order to
> follow the transmitting process.
> 
> One thing you might try is using the "-T" option in netperf to see if
> the behaviour occurs if the process is bound to a specific CPU.  Another
> thing you might try would be to disable ATR by enabling ntuple.  You
> should be able to do that with  "ethtool -K eth5 ntuple on".

I used taskset 1 netperf, to force netperf running on cpu0.

Problem is incoming ACKs seem to be spreaded, so TCP stack might run on
all cpus.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [BUG] ixgbe: something wrong with queue selection ?
  2012-04-17 16:38     ` John Fastabend
@ 2012-04-17 17:07       ` Ben Hutchings
  0 siblings, 0 replies; 32+ messages in thread
From: Ben Hutchings @ 2012-04-17 17:07 UTC (permalink / raw)
  To: John Fastabend
  Cc: Alexander Duyck, jeffrey.t.kirsher, Eric Dumazet, Skidmore,
	Donald C, Greg Rose, Jesse Brandeburg, netdev

On Tue, 2012-04-17 at 09:38 -0700, John Fastabend wrote:
[...]
> We could consider using sk_tx_queue_get(sk) in select_queue or better
> yet use XPS and initialize it at sw init time. I think this would work
> nicely and finally remove the select_queue() logic I just haven't got
> to it yet. Hoping to get there the next few weeks.

Anything like <http://thread.gmane.org/gmane.linux.network/186700/>?

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* TSO not 10G friendly if peer is close enough
  2012-04-17 16:46     ` Eric Dumazet
@ 2012-04-17 21:38       ` Eric Dumazet
  2012-04-17 21:47         ` David Miller
  2012-04-18 15:49         ` [PATCH net-next] tcp: avoid expensive pskb_expand_head() calls Eric Dumazet
  0 siblings, 2 replies; 32+ messages in thread
From: Eric Dumazet @ 2012-04-17 21:38 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: jeffrey.t.kirsher, Skidmore, Donald C, Greg Rose, John Fastabend,
	Jesse Brandeburg, netdev

After further analysis, I found we hit badly page refcounts games,
because when we transmit full size skb (64 KB), we can receive ACK for
the first MSS of the frame while skb was not completely sent by NIC.

(Needs 52 us to send a full TSO frame at 10Gb, and maybe NIC delays
interrupt to trigger TX completion ?)

In this case, tcp_trim_head() has to call pskb_expand_head(), because
skb clone is still alive in TX ring buffer.

pskb_expand_head() is really expensive, it has to make about 32 atomic
operations on page refcounts.

Hmm... maybe tcp_trim_head should not trim but only update an offset in
skb... With some luck, offset can reach skb->len when all data is
ACKnowledged...

Only in case of retransmit we would need to really trim the skb, and by
this time, clone would had been freed to : No more pskb_expand_head()
calls.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: TSO not 10G friendly if peer is close enough
  2012-04-17 21:38       ` TSO not 10G friendly if peer is close enough Eric Dumazet
@ 2012-04-17 21:47         ` David Miller
  2012-04-18  3:00           ` Eric Dumazet
  2012-04-18 15:49         ` [PATCH net-next] tcp: avoid expensive pskb_expand_head() calls Eric Dumazet
  1 sibling, 1 reply; 32+ messages in thread
From: David Miller @ 2012-04-17 21:47 UTC (permalink / raw)
  To: eric.dumazet
  Cc: alexander.h.duyck, jeffrey.t.kirsher, donald.c.skidmore,
	gregory.v.rose, john.r.fastabend, jesse.brandeburg, netdev

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 17 Apr 2012 23:38:42 +0200

> Hmm... maybe tcp_trim_head should not trim but only update an offset in
> skb... With some luck, offset can reach skb->len when all data is
> ACKnowledged...

This is definitely the way to fix this.  Just essentially defer all
the page operations until later when the entire SKB is consumed.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: TSO not 10G friendly if peer is close enough
  2012-04-17 21:47         ` David Miller
@ 2012-04-18  3:00           ` Eric Dumazet
  0 siblings, 0 replies; 32+ messages in thread
From: Eric Dumazet @ 2012-04-18  3:00 UTC (permalink / raw)
  To: David Miller
  Cc: alexander.h.duyck, jeffrey.t.kirsher, donald.c.skidmore,
	gregory.v.rose, john.r.fastabend, jesse.brandeburg, netdev

On Tue, 2012-04-17 at 17:47 -0400, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Tue, 17 Apr 2012 23:38:42 +0200
> 
> > Hmm... maybe tcp_trim_head should not trim but only update an offset in
> > skb... With some luck, offset can reach skb->len when all data is
> > ACKnowledged...
> 
> This is definitely the way to fix this.  Just essentially defer all
> the page operations until later when the entire SKB is consumed.

Yes, I'll implement this today.

Thanks 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH net-next] tcp: avoid expensive pskb_expand_head() calls
  2012-04-17 21:38       ` TSO not 10G friendly if peer is close enough Eric Dumazet
  2012-04-17 21:47         ` David Miller
@ 2012-04-18 15:49         ` Eric Dumazet
       [not found]           ` <4F8EF317.10504@hp.com>
                             ` (2 more replies)
  1 sibling, 3 replies; 32+ messages in thread
From: Eric Dumazet @ 2012-04-18 15:49 UTC (permalink / raw)
  To: David Miller
  Cc: netdev, Tom Herbert, Neal Cardwell, Maciej Żenczykowski,
	Yuchung Cheng

From: Eric Dumazet <edumazet@google.com>

While doing netperf sessions on 10Gb Intel nics (ixgbe), I noticed
unexpected profiling results, with pskb_expand_head() being in the top.

After further analysis, I found we hit badly page refcounts,
because when we transmit full size skb (64 KB), we can receive ACK for
the first segments of the frame while skb was not completely sent by
NIC.

It takes ~54 us to send a full TSO packet at 10Gb speed, but with a
close peer, we can receive TCP ACK in less than 50 us rtt.

This is also true on 1Gb links but we were limited by wire speed, not
cpu.

When we try to trim skb, tcp_trim_head() has to call pskb_expand_head(),
because the skb clone we did for transmit is still alive in TX ring
buffer.

pskb_expand_head() is really expensive : It has to make about 16+16
atomic operations on page refcounts, not counting the skb head
reallocation/copy. It increases chances of false sharing.

In fact, we dont really need to trim skb. This costly operation can be
delayed to the point it is really needed : Thats when a retransmit must
happen.

Most of the time, upcoming ACKS will ack the whole packet, and we can
free it with minimal cost (since clone was already freed by TX
completion)

Of course, this means we dont uncharge the acked part from socket limits
until retransmit, but this is hardly a concern with current autotuning
(around 4MB per socket)
Even with small cwnd limit, a single packet can not hold more than half
the window.

Performance results on my Q6600 cpu and 82599EB 10-Gigabit card :
About 3% less cpu used for same workload (single netperf TCP_STREAM),
bounded by x4 PCI-e slots (4660 Mbits).

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Maciej Żenczykowski <maze@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
---
 include/net/tcp.h     |    6 ++++--
 net/ipv4/tcp_input.c  |   24 +++++++++++++-----------
 net/ipv4/tcp_output.c |   17 +++++++++++------
 3 files changed, 28 insertions(+), 19 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index d5984e3..0f57706 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -477,7 +477,8 @@ extern int tcp_retransmit_skb(struct sock *, struct sk_buff *);
 extern void tcp_retransmit_timer(struct sock *sk);
 extern void tcp_xmit_retransmit_queue(struct sock *);
 extern void tcp_simple_retransmit(struct sock *);
-extern int tcp_trim_head(struct sock *, struct sk_buff *, u32);
+extern void tcp_set_skb_tso_segs(const struct sock *sk, struct sk_buff *skb,
+				 unsigned int mss_now);
 extern int tcp_fragment(struct sock *, struct sk_buff *, u32, unsigned int);
 
 extern void tcp_send_probe0(struct sock *);
@@ -640,7 +641,8 @@ struct tcp_skb_cb {
 #if IS_ENABLED(CONFIG_IPV6)
 		struct inet6_skb_parm	h6;
 #endif
-	} header;	/* For incoming frames		*/
+		unsigned int offset_ack; /* part of acked data in this skb */
+	} header;
 	__u32		seq;		/* Starting sequence number	*/
 	__u32		end_seq;	/* SEQ + FIN + SYN + datalen	*/
 	__u32		when;		/* used to compute rtt's	*/
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 99448f0..529740c 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3260,25 +3260,27 @@ static void tcp_rearm_rto(struct sock *sk)
 	}
 }
 
-/* If we get here, the whole TSO packet has not been acked. */
+/* If we get here, the whole packet has not been acked.
+ * We used to call tcp_trim_head() to remove acked data from skb,
+ * but its expensive with TSO if our previous clone is still in flight.
+ * We thus maintain an offset_ack, and hope no pskb_expand_head()
+ * is needed until whole packet is acked by upcoming ACKs.
+ */
 static u32 tcp_tso_acked(struct sock *sk, struct sk_buff *skb)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
-	u32 packets_acked;
+	u32 prev_packets_acked;
 
 	BUG_ON(!after(TCP_SKB_CB(skb)->end_seq, tp->snd_una));
 
-	packets_acked = tcp_skb_pcount(skb);
-	if (tcp_trim_head(sk, skb, tp->snd_una - TCP_SKB_CB(skb)->seq))
-		return 0;
-	packets_acked -= tcp_skb_pcount(skb);
+	prev_packets_acked = tcp_skb_pcount(skb);
 
-	if (packets_acked) {
-		BUG_ON(tcp_skb_pcount(skb) == 0);
-		BUG_ON(!before(TCP_SKB_CB(skb)->seq, TCP_SKB_CB(skb)->end_seq));
-	}
+	TCP_SKB_CB(skb)->header.offset_ack = tp->snd_una - TCP_SKB_CB(skb)->seq;
+
+	if (tcp_skb_pcount(skb) > 1)
+		tcp_set_skb_tso_segs(sk, skb, tcp_skb_mss(skb));
 
-	return packets_acked;
+	return prev_packets_acked - tcp_skb_pcount(skb);
 }
 
 /* Remove acknowledged frames from the retransmission queue. If our packet
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index de8790c..426b400 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -927,11 +927,15 @@ static void tcp_queue_skb(struct sock *sk, struct sk_buff *skb)
 	sk_mem_charge(sk, skb->truesize);
 }
 
-/* Initialize TSO segments for a packet. */
-static void tcp_set_skb_tso_segs(const struct sock *sk, struct sk_buff *skb,
-				 unsigned int mss_now)
+/* Initialize TSO segments for a packet.
+ * Part of skb (offset_ack) might have been acked.
+ */
+void tcp_set_skb_tso_segs(const struct sock *sk, struct sk_buff *skb,
+			  unsigned int mss_now)
 {
-	if (skb->len <= mss_now || !sk_can_gso(sk) ||
+	unsigned int len = skb->len - TCP_SKB_CB(skb)->header.offset_ack;
+
+	if (len <= mss_now || !sk_can_gso(sk) ||
 	    skb->ip_summed == CHECKSUM_NONE) {
 		/* Avoid the costly divide in the normal
 		 * non-TSO case.
@@ -940,7 +944,7 @@ static void tcp_set_skb_tso_segs(const struct sock *sk, struct sk_buff *skb,
 		skb_shinfo(skb)->gso_size = 0;
 		skb_shinfo(skb)->gso_type = 0;
 	} else {
-		skb_shinfo(skb)->gso_segs = DIV_ROUND_UP(skb->len, mss_now);
+		skb_shinfo(skb)->gso_segs = DIV_ROUND_UP(len, mss_now);
 		skb_shinfo(skb)->gso_size = mss_now;
 		skb_shinfo(skb)->gso_type = sk->sk_gso_type;
 	}
@@ -1126,7 +1130,7 @@ static void __pskb_trim_head(struct sk_buff *skb, int len)
 }
 
 /* Remove acked data from a packet in the transmit queue. */
-int tcp_trim_head(struct sock *sk, struct sk_buff *skb, u32 len)
+static int tcp_trim_head(struct sock *sk, struct sk_buff *skb, u32 len)
 {
 	if (skb_cloned(skb) && pskb_expand_head(skb, 0, 0, GFP_ATOMIC))
 		return -ENOMEM;
@@ -1134,6 +1138,7 @@ int tcp_trim_head(struct sock *sk, struct sk_buff *skb, u32 len)
 	__pskb_trim_head(skb, len);
 
 	TCP_SKB_CB(skb)->seq += len;
+	TCP_SKB_CB(skb)->header.offset_ack = 0;
 	skb->ip_summed = CHECKSUM_PARTIAL;
 
 	skb->truesize	     -= len;

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH net-next] tcp: avoid expensive pskb_expand_head() calls
       [not found]           ` <4F8EF317.10504@hp.com>
@ 2012-04-18 17:16             ` Eric Dumazet
  2012-04-18 17:30               ` Rick Jones
  0 siblings, 1 reply; 32+ messages in thread
From: Eric Dumazet @ 2012-04-18 17:16 UTC (permalink / raw)
  To: Rick Jones; +Cc: netdev

On Wed, 2012-04-18 at 10:00 -0700, Rick Jones wrote:

> Is the issue completely sent, or transmit completion processed?  I'd 
> think it is time to the latter that matters (and includes the former) yes?
> 

I dont know. Fact is we process ACKs before clone skb is freed by TX
completion.

> Does the ixgbe driver do transmit completions first when it gets a 
> receive interrupt, or is there still the chance that the receipt of the 
> last ACK for the 64KB skb will hit TCP before the driver has done the 
> free?  (Or does that not matter?)

It does transmit completions first, but that doesnt matter, since we
receive ACK before skb could be drained by NIC and returned to driver
for TX completion.

> 
> > Performance results on my Q6600 cpu and 82599EB 10-Gigabit card :
> > About 3% less cpu used for same workload (single netperf TCP_STREAM),
> > bounded by x4 PCI-e slots (4660 Mbits).
> 
> Three percent less or three percentage points less?  Including the 
> details of the netperf-reported service demand would make that clear.

netperf results are not precise enough, since my setup is limited by PCI
bandwidth. here are the "perf stat" ones

Maybe someone can run the test on 20Gb/40Gb links, and NUMA machine.

Before patch :

# perf stat -r 5 -d -d -o RES.before taskset 1 netperf -H 192.168.99.1 -l 20

 Performance counter stats for 'taskset 1 netperf -H 192.168.99.1 -l 20' (5 runs):

       6252,882411 task-clock                #    0,312 CPUs utilized            ( +-  0,51% )
             5 988 context-switches          #    0,958 K/sec                    ( +-  0,34% )
                 2 CPU-migrations            #    0,000 K/sec                    ( +- 15,31% )
               389 page-faults               #    0,062 K/sec                  
     9 938 280 877 cycles                    #    1,589 GHz                      ( +-  0,55% ) [21,19%]
   <not supported> stalled-cycles-frontend 
   <not supported> stalled-cycles-backend  
    11 709 374 305 instructions              #    1,18  insns per cycle          ( +-  0,28% ) [21,32%]
     1 026 659 544 branches                  #  164,190 M/sec                    ( +-  0,40% ) [21,49%]
        10 898 375 branch-misses             #    1,06% of all branches          ( +-  1,87% ) [21,54%]
     5 238 382 991 L1-dcache-loads           #  837,755 M/sec                    ( +-  0,21% ) [14,26%]
     1 117 076 847 L1-dcache-load-misses     #   21,32% of all L1-dcache hits    ( +-  0,49% ) [14,19%]
       166 208 073 LLC-loads                 #   26,581 M/sec                    ( +-  0,88% ) [14,33%]
         3 220 627 LLC-load-misses           #    1,94% of all LL-cache hits     ( +-  2,39% ) [14,31%]
     9 470 544 759 L1-icache-loads           # 1514,589 M/sec                    ( +-  0,44% ) [14,41%]
        23 602 610 L1-icache-load-misses     #    0,25% of all L1-icache hits    ( +-  3,10% ) [14,49%]
     5 241 137 739 dTLB-loads                #  838,195 M/sec                    ( +-  0,18% ) [14,20%]
         4 970 360 dTLB-load-misses          #    0,09% of all dTLB cache hits   ( +-  1,01% ) [14,47%]
    11 720 311 101 iTLB-loads                # 1874,385 M/sec                    ( +-  0,34% ) [21,33%]
           587 825 iTLB-load-misses          #    0,01% of all iTLB cache hits   ( +- 31,06% ) [21,52%]

      20,018804246 seconds time elapsed                                          ( +-  0,00% )


After patch :

# perf stat -r 5 -d -d -o RES.after taskset 1 netperf -H 192.168.99.1 -l 20

 Performance counter stats for 'taskset 1 netperf -H 192.168.99.1 -l 20' (5 runs):

       6061,208375 task-clock                #    0,303 CPUs utilized            ( +-  0,18% )
             6 032 context-switches          #    0,995 K/sec                    ( +-  0,22% )
                 2 CPU-migrations            #    0,000 K/sec                    ( +- 52,44% )
               390 page-faults               #    0,064 K/sec                    ( +-  0,05% )
     9 623 179 185 cycles                    #    1,588 GHz                      ( +-  0,16% ) [21,33%]
   <not supported> stalled-cycles-frontend 
   <not supported> stalled-cycles-backend  
    11 724 650 132 instructions              #    1,22  insns per cycle          ( +-  0,22% ) [21,52%]
     1 025 017 197 branches                  #  169,111 M/sec                    ( +-  0,29% ) [21,75%]
        10 464 785 branch-misses             #    1,02% of all branches          ( +-  1,78% ) [21,82%]
     5 230 299 185 L1-dcache-loads           #  862,914 M/sec                    ( +-  0,20% ) [14,55%]
     1 109 236 741 L1-dcache-load-misses     #   21,21% of all L1-dcache hits    ( +-  0,59% ) [14,59%]
       161 721 826 LLC-loads                 #   26,681 M/sec                    ( +-  0,58% ) [14,25%]
         2 974 990 LLC-load-misses           #    1,84% of all LL-cache hits     ( +-  0,95% ) [14,13%]
     9 233 690 637 L1-icache-loads           # 1523,408 M/sec                    ( +-  0,24% ) [14,14%]
        17 177 769 L1-icache-load-misses     #    0,19% of all L1-icache hits    ( +-  0,69% ) [14,05%]
     5 218 114 832 dTLB-loads                #  860,903 M/sec                    ( +-  0,12% ) [14,23%]
         4 980 060 dTLB-load-misses          #    0,10% of all dTLB cache hits   ( +-  1,23% ) [14,33%]
    11 743 563 935 iTLB-loads                # 1937,495 M/sec                    ( +-  0,13% ) [21,38%]
           959 598 iTLB-load-misses          #    0,01% of all iTLB cache hits   ( +- 24,72% ) [21,33%]

      20,019067285 seconds time elapsed                                          ( +-  0,00% )

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH net-next] tcp: avoid expensive pskb_expand_head() calls
  2012-04-18 17:16             ` Eric Dumazet
@ 2012-04-18 17:30               ` Rick Jones
  2012-04-18 17:40                 ` Eric Dumazet
  0 siblings, 1 reply; 32+ messages in thread
From: Rick Jones @ 2012-04-18 17:30 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev

On 04/18/2012 10:16 AM, Eric Dumazet wrote:
> On Wed, 2012-04-18 at 10:00 -0700, Rick Jones wrote:
>
>> Is the issue completely sent, or transmit completion processed?  I'd
>> think it is time to the latter that matters (and includes the former) yes?
>>
>
> I dont know. Fact is we process ACKs before clone skb is freed by TX
> completion.
>
>> Does the ixgbe driver do transmit completions first when it gets a
>> receive interrupt, or is there still the chance that the receipt of the
>> last ACK for the 64KB skb will hit TCP before the driver has done the
>> free?  (Or does that not matter?)
>
> It does transmit completions first, but that doesnt matter, since we
> receive ACK before skb could be drained by NIC and returned to driver
> for TX completion.

I was thinking more about the race if any between the ACK for the last 
byte of the 64 KB skb and the transmit completion processing freeing it 
in the driver.  But that may be moot.


>>> Performance results on my Q6600 cpu and 82599EB 10-Gigabit card :
>>> About 3% less cpu used for same workload (single netperf TCP_STREAM),
>>> bounded by x4 PCI-e slots (4660 Mbits).
>>
>> Three percent less or three percentage points less?  Including the
>> details of the netperf-reported service demand would make that clear.
>
> netperf results are not precise enough, since my setup is limited by PCI
> bandwidth. here are the "perf stat" ones

I'm confused -  Netperf's CPU utilization measurements (-c -C) and by 
extension service demand calculation should be able to see an overall 
three percentage point change in CPU util, even a three percent one.

>
> Maybe someone can run the test on 20Gb/40Gb links, and NUMA machine.
>
> Before patch :
>
> # perf stat -r 5 -d -d -o RES.before taskset 1 netperf -H 192.168.99.1 -l 20

I'm still learning about perf, and the manpage I have for it does not 
discuss the -d option but is that doing system wide, or only in the 
context of the netperf process?

rick

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH net-next] tcp: avoid expensive pskb_expand_head() calls
  2012-04-18 17:30               ` Rick Jones
@ 2012-04-18 17:40                 ` Eric Dumazet
  0 siblings, 0 replies; 32+ messages in thread
From: Eric Dumazet @ 2012-04-18 17:40 UTC (permalink / raw)
  To: Rick Jones; +Cc: netdev

On Wed, 2012-04-18 at 10:30 -0700, Rick Jones wrote:

> I was thinking more about the race if any between the ACK for the last 
> byte of the 64 KB skb and the transmit completion processing freeing it 
> in the driver.  But that may be moot.
> 

No race, its done with atomic operations.


> >
> > # perf stat -r 5 -d -d -o RES.before taskset 1 netperf -H 192.168.99.1 -l 20
> 
> I'm still learning about perf, and the manpage I have for it does not 
> discuss the -d option but is that doing system wide, or only in the 
> context of the netperf process?

In fact I used -a option and forgot to copy it in the mail I sent

perf stat -h

 usage: perf stat [<options>] [<command>]

    -e, --event <event>   event selector. use 'perf list' to list available events
        --filter <filter>
                          event filter
    -i, --no-inherit      child tasks do not inherit counters
    -p, --pid <pid>       stat events on existing process id
    -t, --tid <tid>       stat events on existing thread id
    -a, --all-cpus        system-wide collection from all CPUs
    -g, --group           put the counters into a counter group
    -c, --scale           scale/normalize counters
    -v, --verbose         be more verbose (show counter open errors, etc)
    -r, --repeat <n>      repeat command and print average + stddev (max: 100)
    -n, --null            null run - dont start any counters
    -d, --detailed        detailed run - start a lot of events
    -S, --sync            call sync() before starting a run
    -B, --big-num         print large numbers with thousands' separators
    -C, --cpu <cpu>       list of cpus to monitor in system-wide
    -A, --no-aggr         disable CPU count aggregation
    -x, --field-separator <separator>
                          print counts with custom separator
    -G, --cgroup <name>   monitor event in cgroup name only
    -o, --output <file>   output file name
        --append          append to the output file
        --log-fd <n>      log output to fd, instead of stderr

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH net-next] tcp: avoid expensive pskb_expand_head() calls
  2012-04-18 15:49         ` [PATCH net-next] tcp: avoid expensive pskb_expand_head() calls Eric Dumazet
       [not found]           ` <4F8EF317.10504@hp.com>
@ 2012-04-18 18:40           ` Neal Cardwell
  2012-04-18 19:18             ` Eric Dumazet
  2012-04-18 19:41           ` [PATCH " Vijay Subramanian
  2 siblings, 1 reply; 32+ messages in thread
From: Neal Cardwell @ 2012-04-18 18:40 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, netdev, Tom Herbert, Maciej Żenczykowski,
	Yuchung Cheng

On Wed, Apr 18, 2012 at 11:49 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> From: Eric Dumazet <edumazet@google.com>
>
> While doing netperf sessions on 10Gb Intel nics (ixgbe), I noticed
> unexpected profiling results, with pskb_expand_head() being in the top.
>
> After further analysis, I found we hit badly page refcounts,
> because when we transmit full size skb (64 KB), we can receive ACK for
> the first segments of the frame while skb was not completely sent by
> NIC.
>
> It takes ~54 us to send a full TSO packet at 10Gb speed, but with a
> close peer, we can receive TCP ACK in less than 50 us rtt.
>
> This is also true on 1Gb links but we were limited by wire speed, not
> cpu.
>
> When we try to trim skb, tcp_trim_head() has to call pskb_expand_head(),
> because the skb clone we did for transmit is still alive in TX ring
> buffer.
>
> pskb_expand_head() is really expensive : It has to make about 16+16
> atomic operations on page refcounts, not counting the skb head
> reallocation/copy. It increases chances of false sharing.
>
> In fact, we dont really need to trim skb. This costly operation can be
> delayed to the point it is really needed : Thats when a retransmit must
> happen.
>
> Most of the time, upcoming ACKS will ack the whole packet, and we can
> free it with minimal cost (since clone was already freed by TX
> completion)
>
> Of course, this means we dont uncharge the acked part from socket limits
> until retransmit, but this is hardly a concern with current autotuning
> (around 4MB per socket)
> Even with small cwnd limit, a single packet can not hold more than half
> the window.
>
> Performance results on my Q6600 cpu and 82599EB 10-Gigabit card :
> About 3% less cpu used for same workload (single netperf TCP_STREAM),
> bounded by x4 PCI-e slots (4660 Mbits).
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Tom Herbert <therbert@google.com>
> Cc: Neal Cardwell <ncardwell@google.com>
> Cc: Maciej Żenczykowski <maze@google.com>
> Cc: Yuchung Cheng <ycheng@google.com>

Nice find!

> +       prev_packets_acked = tcp_skb_pcount(skb);

FWIW, I'd find old_pcount or prev_pcount a little easier to read than
prev_packets_acked here (I see "oldpcount" is used in tcp_output
in a similar context).

>        TCP_SKB_CB(skb)->seq += len;
> +       TCP_SKB_CB(skb)->header.offset_ack = 0;

If the caller decides to trim a prefix of the skb that does not extend
to snd_una, then setting offset_ack to 0 here will cause us to forget
that some prefix is ACKed when we should have remembered this.
However, the API for the function invites the caller to chop off an
arbitrary amount (and there's no comment to disuade the caller from
trying this). This seems to risk bugs in the future.

To attempt to make this API safer and simpler for future generations,
what do you think about calculating the len inside tcp_trim_head(),
something like:

static int tcp_trim_head(struct sock *sk, struct sk_buff *skb)
{
  u32 len = tp->snd_una - TCP_SKB_CB(skb)->seq;
...

...
 if (tcp_trim_head(sk, skb))
...

neal

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH net-next] tcp: avoid expensive pskb_expand_head() calls
  2012-04-18 18:40           ` Neal Cardwell
@ 2012-04-18 19:18             ` Eric Dumazet
  2012-04-18 19:51               ` [PATCH v2 " Eric Dumazet
  0 siblings, 1 reply; 32+ messages in thread
From: Eric Dumazet @ 2012-04-18 19:18 UTC (permalink / raw)
  To: Neal Cardwell
  Cc: David Miller, netdev, Tom Herbert, Maciej Żenczykowski,
	Yuchung Cheng

On Wed, 2012-04-18 at 14:40 -0400, Neal Cardwell wrote:

> 
> > +       prev_packets_acked = tcp_skb_pcount(skb);
> 
> FWIW, I'd find old_pcount or prev_pcount a little easier to read than
> prev_packets_acked here (I see "oldpcount" is used in tcp_output
> in a similar context).
> 

You're right, I'll change this.

> >        TCP_SKB_CB(skb)->seq += len;
> > +       TCP_SKB_CB(skb)->header.offset_ack = 0;
> 
> If the caller decides to trim a prefix of the skb that does not extend
> to snd_una, then setting offset_ack to 0 here will cause us to forget
> that some prefix is ACKed when we should have remembered this.
> However, the API for the function invites the caller to chop off an
> arbitrary amount (and there's no comment to disuade the caller from
> trying this). This seems to risk bugs in the future.
> 
> To attempt to make this API safer and simpler for future generations,
> what do you think about calculating the len inside tcp_trim_head(),
> something like:
> 
> static int tcp_trim_head(struct sock *sk, struct sk_buff *skb)
> {
>   u32 len = tp->snd_una - TCP_SKB_CB(skb)->seq;
> ...
> 
> ...
>  if (tcp_trim_head(sk, skb))
> ...
> 

Very good point, I'll use this suggestion too in v2.

Thanks !

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH net-next] tcp: avoid expensive pskb_expand_head() calls
  2012-04-18 15:49         ` [PATCH net-next] tcp: avoid expensive pskb_expand_head() calls Eric Dumazet
       [not found]           ` <4F8EF317.10504@hp.com>
  2012-04-18 18:40           ` Neal Cardwell
@ 2012-04-18 19:41           ` Vijay Subramanian
  2012-04-18 19:49             ` Eric Dumazet
  2 siblings, 1 reply; 32+ messages in thread
From: Vijay Subramanian @ 2012-04-18 19:41 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, netdev, Tom Herbert, Neal Cardwell,
	Maciej Żenczykowski, Yuchung Cheng

>  {
>        if (skb_cloned(skb) && pskb_expand_head(skb, 0, 0, GFP_ATOMIC))
>                return -ENOMEM;
> @@ -1134,6 +1138,7 @@ int tcp_trim_head(struct sock *sk, struct sk_buff *skb, u32 len)
>        __pskb_trim_head(skb, len);
>
>        TCP_SKB_CB(skb)->seq += len;
> +       TCP_SKB_CB(skb)->header.offset_ack = 0;
>        skb->ip_summed = CHECKSUM_PARTIAL;
>
>        skb->truesize        -= len;
>

Eric,
tcp_trim_head() also updates skb->truesize as above. But is this the
right thing to do when only offsets/pointers are updated. If
__pskb_trim_head() removes len bytes but does not actually free
memory,
should truesize be updated?  This could happen if data in the linear
part is acked.  This behavior takes place currently even without your
patch, by the way.


Thanks,
Vijay

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH net-next] tcp: avoid expensive pskb_expand_head() calls
  2012-04-18 19:41           ` [PATCH " Vijay Subramanian
@ 2012-04-18 19:49             ` Eric Dumazet
  0 siblings, 0 replies; 32+ messages in thread
From: Eric Dumazet @ 2012-04-18 19:49 UTC (permalink / raw)
  To: Vijay Subramanian
  Cc: David Miller, netdev, Tom Herbert, Neal Cardwell,
	Maciej Żenczykowski, Yuchung Cheng

On Wed, 2012-04-18 at 12:41 -0700, Vijay Subramanian wrote:
> >  {
> >        if (skb_cloned(skb) && pskb_expand_head(skb, 0, 0, GFP_ATOMIC))
> >                return -ENOMEM;
> > @@ -1134,6 +1138,7 @@ int tcp_trim_head(struct sock *sk, struct sk_buff *skb, u32 len)
> >        __pskb_trim_head(skb, len);
> >
> >        TCP_SKB_CB(skb)->seq += len;
> > +       TCP_SKB_CB(skb)->header.offset_ack = 0;
> >        skb->ip_summed = CHECKSUM_PARTIAL;
> >
> >        skb->truesize        -= len;
> >
> 
> Eric,
> tcp_trim_head() also updates skb->truesize as above. But is this the
> right thing to do when only offsets/pointers are updated. If
> __pskb_trim_head() removes len bytes but does not actually free
> memory,
> should truesize be updated?  This could happen if data in the linear
> part is acked.  This behavior takes place currently even without your
> patch, by the way.
> 

Thats ok.

__pskb_trim_head() callers are responsible for taking care of truesize
changes.

By the way, its always complex to guess truesize changes. removing some
bytes in skb head doesnt free memory, so in fact we should not change
truesize in this case.

But only one skb in tcp write queue might have a slightly wrong
truesize, its not a big deal.

Problem with skb->truesize is mostly in input path, when first
skb->truesize is not correctly set by drivers, upper stack can
accumulate errors in socket receive queue.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v2 net-next] tcp: avoid expensive pskb_expand_head() calls
  2012-04-18 19:18             ` Eric Dumazet
@ 2012-04-18 19:51               ` Eric Dumazet
  2012-04-19 11:10                 ` Ilpo Järvinen
  0 siblings, 1 reply; 32+ messages in thread
From: Eric Dumazet @ 2012-04-18 19:51 UTC (permalink / raw)
  To: Neal Cardwell
  Cc: David Miller, netdev, Tom Herbert, Maciej Żenczykowski,
	Yuchung Cheng

From: Eric Dumazet <edumazet@google.com>

While doing netperf sessions on 10Gb Intel nics (ixgbe), I noticed
unexpected profiling results, with pskb_expand_head() being in the top.

After further analysis, I found we hit badly page refcounts,
because when we transmit full size skb (64 KB), we can receive ACK for
the first segments of the frame while skb was not completely sent by
NIC.

It takes ~54 us to send a full TSO packet at 10Gb speed, but with a
close peer, we can receive TCP ACK in less than 50 us rtt.

This is also true on 1Gb links but we were limited by wire speed, not
cpu.

When we try to trim skb, tcp_trim_head() has to call pskb_expand_head(),
because the skb clone we did for transmit is still alive in TX ring
buffer.

pskb_expand_head() is really expensive : It has to make about 16+16
atomic operations on page refcounts, not counting the skb head
reallocation/copy. It increases chances of false sharing.

In fact, we dont really need to trim skb. This costly operation can be
delayed to the point it is really needed : Thats when a retransmit must
happen.

Most of the time, upcoming ACKS will ack the whole packet, and we can
free it with minimal cost (since clone was already freed by TX
completion)

Of course, this means we dont uncharge the acked part from socket limits
until retransmit, but this is hardly a concern with current autotuning
(around 4MB per socket)
Even with small cwnd limit, a single packet can not hold more than half
the window.

Performance results on my Q6600 cpu and 82599EB 10-Gigabit card :
About 3% less cpu used for same workload (single netperf TCP_STREAM),
bounded by x4 PCI-e slots (4660 Mbits).

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Maciej Żenczykowski <maze@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
---
v2 : added Neal suggestions

 include/net/tcp.h     |    6 ++++--
 net/ipv4/tcp_input.c  |   22 +++++++++++-----------
 net/ipv4/tcp_output.c |   25 +++++++++++++++++--------
 3 files changed, 32 insertions(+), 21 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index d5984e3..0f57706 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -477,7 +477,8 @@ extern int tcp_retransmit_skb(struct sock *, struct sk_buff *);
 extern void tcp_retransmit_timer(struct sock *sk);
 extern void tcp_xmit_retransmit_queue(struct sock *);
 extern void tcp_simple_retransmit(struct sock *);
-extern int tcp_trim_head(struct sock *, struct sk_buff *, u32);
+extern void tcp_set_skb_tso_segs(const struct sock *sk, struct sk_buff *skb,
+				 unsigned int mss_now);
 extern int tcp_fragment(struct sock *, struct sk_buff *, u32, unsigned int);
 
 extern void tcp_send_probe0(struct sock *);
@@ -640,7 +641,8 @@ struct tcp_skb_cb {
 #if IS_ENABLED(CONFIG_IPV6)
 		struct inet6_skb_parm	h6;
 #endif
-	} header;	/* For incoming frames		*/
+		unsigned int offset_ack; /* part of acked data in this skb */
+	} header;
 	__u32		seq;		/* Starting sequence number	*/
 	__u32		end_seq;	/* SEQ + FIN + SYN + datalen	*/
 	__u32		when;		/* used to compute rtt's	*/
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 99448f0..bdec2e6 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3260,25 +3260,25 @@ static void tcp_rearm_rto(struct sock *sk)
 	}
 }
 
-/* If we get here, the whole TSO packet has not been acked. */
+/* If we get here, the whole packet has not been acked.
+ * We used to call tcp_trim_head() to remove acked data from skb,
+ * but its expensive with TSO if our previous clone is still in flight.
+ * We thus maintain an offset_ack, and hope no pskb_expand_head()
+ * is needed until whole packet is acked by upcoming ACKs.
+ */
 static u32 tcp_tso_acked(struct sock *sk, struct sk_buff *skb)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
-	u32 packets_acked;
+	u32 oldpcount = tcp_skb_pcount(skb);
 
 	BUG_ON(!after(TCP_SKB_CB(skb)->end_seq, tp->snd_una));
 
-	packets_acked = tcp_skb_pcount(skb);
-	if (tcp_trim_head(sk, skb, tp->snd_una - TCP_SKB_CB(skb)->seq))
-		return 0;
-	packets_acked -= tcp_skb_pcount(skb);
+	TCP_SKB_CB(skb)->header.offset_ack = tp->snd_una - TCP_SKB_CB(skb)->seq;
 
-	if (packets_acked) {
-		BUG_ON(tcp_skb_pcount(skb) == 0);
-		BUG_ON(!before(TCP_SKB_CB(skb)->seq, TCP_SKB_CB(skb)->end_seq));
-	}
+	if (oldpcount > 1)
+		tcp_set_skb_tso_segs(sk, skb, tcp_skb_mss(skb));
 
-	return packets_acked;
+	return oldpcount - tcp_skb_pcount(skb);
 }
 
 /* Remove acknowledged frames from the retransmission queue. If our packet
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index de8790c..f66df37 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -927,11 +927,15 @@ static void tcp_queue_skb(struct sock *sk, struct sk_buff *skb)
 	sk_mem_charge(sk, skb->truesize);
 }
 
-/* Initialize TSO segments for a packet. */
-static void tcp_set_skb_tso_segs(const struct sock *sk, struct sk_buff *skb,
-				 unsigned int mss_now)
+/* Initialize TSO segments for a packet.
+ * Part of skb (offset_ack) might have been acked.
+ */
+void tcp_set_skb_tso_segs(const struct sock *sk, struct sk_buff *skb,
+			  unsigned int mss_now)
 {
-	if (skb->len <= mss_now || !sk_can_gso(sk) ||
+	unsigned int len = skb->len - TCP_SKB_CB(skb)->header.offset_ack;
+
+	if (len <= mss_now || !sk_can_gso(sk) ||
 	    skb->ip_summed == CHECKSUM_NONE) {
 		/* Avoid the costly divide in the normal
 		 * non-TSO case.
@@ -940,7 +944,7 @@ static void tcp_set_skb_tso_segs(const struct sock *sk, struct sk_buff *skb,
 		skb_shinfo(skb)->gso_size = 0;
 		skb_shinfo(skb)->gso_type = 0;
 	} else {
-		skb_shinfo(skb)->gso_segs = DIV_ROUND_UP(skb->len, mss_now);
+		skb_shinfo(skb)->gso_segs = DIV_ROUND_UP(len, mss_now);
 		skb_shinfo(skb)->gso_size = mss_now;
 		skb_shinfo(skb)->gso_type = sk->sk_gso_type;
 	}
@@ -1125,15 +1129,20 @@ static void __pskb_trim_head(struct sk_buff *skb, int len)
 	skb->len = skb->data_len;
 }
 
-/* Remove acked data from a packet in the transmit queue. */
-int tcp_trim_head(struct sock *sk, struct sk_buff *skb, u32 len)
+/* Remove acked data from a packet in the transmit queue.
+ * Ony called before retransmit.
+ */
+static int tcp_trim_head(struct sock *sk, struct sk_buff *skb)
 {
+	u32 len = tcp_sk(sk)->snd_una - TCP_SKB_CB(skb)->seq;
+
 	if (skb_cloned(skb) && pskb_expand_head(skb, 0, 0, GFP_ATOMIC))
 		return -ENOMEM;
 
 	__pskb_trim_head(skb, len);
 
 	TCP_SKB_CB(skb)->seq += len;
+	TCP_SKB_CB(skb)->header.offset_ack = 0;
 	skb->ip_summed = CHECKSUM_PARTIAL;
 
 	skb->truesize	     -= len;
@@ -2096,7 +2105,7 @@ int tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb)
 	if (before(TCP_SKB_CB(skb)->seq, tp->snd_una)) {
 		if (before(TCP_SKB_CB(skb)->end_seq, tp->snd_una))
 			BUG();
-		if (tcp_trim_head(sk, skb, tp->snd_una - TCP_SKB_CB(skb)->seq))
+		if (tcp_trim_head(sk, skb))
 			return -ENOMEM;
 	}
 

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 net-next] tcp: avoid expensive pskb_expand_head() calls
  2012-04-18 19:51               ` [PATCH v2 " Eric Dumazet
@ 2012-04-19 11:10                 ` Ilpo Järvinen
  2012-04-19 11:30                   ` Eric Dumazet
  0 siblings, 1 reply; 32+ messages in thread
From: Ilpo Järvinen @ 2012-04-19 11:10 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Neal Cardwell, David Miller, netdev, Tom Herbert,
	Maciej Żenczykowski, Yuchung Cheng

[-- Attachment #1: Type: TEXT/PLAIN, Size: 4789 bytes --]

On Wed, 18 Apr 2012, Eric Dumazet wrote:

> From: Eric Dumazet <edumazet@google.com>
> 
> While doing netperf sessions on 10Gb Intel nics (ixgbe), I noticed
> unexpected profiling results, with pskb_expand_head() being in the top.
> 
> After further analysis, I found we hit badly page refcounts,
> because when we transmit full size skb (64 KB), we can receive ACK for
> the first segments of the frame while skb was not completely sent by
> NIC.
> 
> It takes ~54 us to send a full TSO packet at 10Gb speed, but with a
> close peer, we can receive TCP ACK in less than 50 us rtt.
> 
> This is also true on 1Gb links but we were limited by wire speed, not
> cpu.
> 
> When we try to trim skb, tcp_trim_head() has to call pskb_expand_head(),
> because the skb clone we did for transmit is still alive in TX ring
> buffer.
> 
> pskb_expand_head() is really expensive : It has to make about 16+16
> atomic operations on page refcounts, not counting the skb head
> reallocation/copy. It increases chances of false sharing.
> 
> In fact, we dont really need to trim skb. This costly operation can be
> delayed to the point it is really needed : Thats when a retransmit must
> happen.
> 
> Most of the time, upcoming ACKS will ack the whole packet, and we can
> free it with minimal cost (since clone was already freed by TX
> completion)
> 
> Of course, this means we dont uncharge the acked part from socket limits
> until retransmit, but this is hardly a concern with current autotuning
> (around 4MB per socket)
> Even with small cwnd limit, a single packet can not hold more than half
> the window.
> 
> Performance results on my Q6600 cpu and 82599EB 10-Gigabit card :
> About 3% less cpu used for same workload (single netperf TCP_STREAM),
> bounded by x4 PCI-e slots (4660 Mbits).
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Tom Herbert <therbert@google.com>
> Cc: Neal Cardwell <ncardwell@google.com>
> Cc: Maciej Żenczykowski <maze@google.com>
> Cc: Yuchung Cheng <ycheng@google.com>
> ---
> v2 : added Neal suggestions
> 
>  include/net/tcp.h     |    6 ++++--
>  net/ipv4/tcp_input.c  |   22 +++++++++++-----------
>  net/ipv4/tcp_output.c |   25 +++++++++++++++++--------
>  3 files changed, 32 insertions(+), 21 deletions(-)
> 
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index d5984e3..0f57706 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -477,7 +477,8 @@ extern int tcp_retransmit_skb(struct sock *, struct sk_buff *);
>  extern void tcp_retransmit_timer(struct sock *sk);
>  extern void tcp_xmit_retransmit_queue(struct sock *);
>  extern void tcp_simple_retransmit(struct sock *);
> -extern int tcp_trim_head(struct sock *, struct sk_buff *, u32);
> +extern void tcp_set_skb_tso_segs(const struct sock *sk, struct sk_buff *skb,
> +				 unsigned int mss_now);
>  extern int tcp_fragment(struct sock *, struct sk_buff *, u32, unsigned int);
>  
>  extern void tcp_send_probe0(struct sock *);
> @@ -640,7 +641,8 @@ struct tcp_skb_cb {
>  #if IS_ENABLED(CONFIG_IPV6)
>  		struct inet6_skb_parm	h6;
>  #endif
> -	} header;	/* For incoming frames		*/
> +		unsigned int offset_ack; /* part of acked data in this skb */
> +	} header;
>  	__u32		seq;		/* Starting sequence number	*/
>  	__u32		end_seq;	/* SEQ + FIN + SYN + datalen	*/
>  	__u32		when;		/* used to compute rtt's	*/
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 99448f0..bdec2e6 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -3260,25 +3260,25 @@ static void tcp_rearm_rto(struct sock *sk)
>  	}
>  }
>  
> -/* If we get here, the whole TSO packet has not been acked. */
> +/* If we get here, the whole packet has not been acked.
> + * We used to call tcp_trim_head() to remove acked data from skb,
> + * but its expensive with TSO if our previous clone is still in flight.
> + * We thus maintain an offset_ack, and hope no pskb_expand_head()
> + * is needed until whole packet is acked by upcoming ACKs.
> + */
>  static u32 tcp_tso_acked(struct sock *sk, struct sk_buff *skb)
>  {
>  	struct tcp_sock *tp = tcp_sk(sk);
> -	u32 packets_acked;
> +	u32 oldpcount = tcp_skb_pcount(skb);
>  
>  	BUG_ON(!after(TCP_SKB_CB(skb)->end_seq, tp->snd_una));
>  
> -	packets_acked = tcp_skb_pcount(skb);
> -	if (tcp_trim_head(sk, skb, tp->snd_una - TCP_SKB_CB(skb)->seq))
> -		return 0;
> -	packets_acked -= tcp_skb_pcount(skb);
> +	TCP_SKB_CB(skb)->header.offset_ack = tp->snd_una - TCP_SKB_CB(skb)->seq;

Now that you have non-zero offset_ack, are the tcp_fragment() callsites 
safe and working? ...I'm mostly worried about tcp_mark_head_lost which 
does some assumptions about tp->snd_una and TCP_SKB_CB(skb)->seq, however, 
also other fragmenting does not preserve offset_ack properly (which might 
not be end of world though)?

-- 
 i.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 net-next] tcp: avoid expensive pskb_expand_head() calls
  2012-04-19 11:10                 ` Ilpo Järvinen
@ 2012-04-19 11:30                   ` Eric Dumazet
  2012-04-19 11:40                     ` Eric Dumazet
  2012-04-19 13:18                     ` Eric Dumazet
  0 siblings, 2 replies; 32+ messages in thread
From: Eric Dumazet @ 2012-04-19 11:30 UTC (permalink / raw)
  To: Ilpo Järvinen
  Cc: Neal Cardwell, David Miller, netdev, Tom Herbert,
	Maciej Żenczykowski, Yuchung Cheng

On Thu, 2012-04-19 at 14:10 +0300, Ilpo Järvinen wrote:

> Now that you have non-zero offset_ack, are the tcp_fragment() callsites 
> safe and working? ...I'm mostly worried about tcp_mark_head_lost which 
> does some assumptions about tp->snd_una and TCP_SKB_CB(skb)->seq, however, 
> also other fragmenting does not preserve offset_ack properly (which might 
> not be end of world though)?

Good point, I'll take a look.

I'll provide a v3 anyway with more performance data, I setup two cards
in PCI x8 slots to get full bandwidth.

Thanks

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 net-next] tcp: avoid expensive pskb_expand_head() calls
  2012-04-19 11:30                   ` Eric Dumazet
@ 2012-04-19 11:40                     ` Eric Dumazet
  2012-04-19 11:57                       ` Ilpo Järvinen
  2012-04-19 13:18                     ` Eric Dumazet
  1 sibling, 1 reply; 32+ messages in thread
From: Eric Dumazet @ 2012-04-19 11:40 UTC (permalink / raw)
  To: Ilpo Järvinen
  Cc: Neal Cardwell, David Miller, netdev, Tom Herbert,
	Maciej Żenczykowski, Yuchung Cheng

On Thu, 2012-04-19 at 13:30 +0200, Eric Dumazet wrote:
> On Thu, 2012-04-19 at 14:10 +0300, Ilpo Järvinen wrote:
> 
> > Now that you have non-zero offset_ack, are the tcp_fragment() callsites 
> > safe and working? ...I'm mostly worried about tcp_mark_head_lost which 
> > does some assumptions about tp->snd_una and TCP_SKB_CB(skb)->seq, however, 
> > also other fragmenting does not preserve offset_ack properly (which might 
> > not be end of world though)?
> 
> Good point, I'll take a look.

Hmm, the only point this could matter is if a packet is retransmitted.
For other packets, offset_ack = 0 (default value on skb allocation)

And tcp_retransmit_skb() first call tcp_trim_head(sk, skb) if needed so
tcp_fragment() is called with == 0

        if (before(TCP_SKB_CB(skb)->seq, tp->snd_una)) {
                if (before(TCP_SKB_CB(skb)->end_seq, tp->snd_una))
                        BUG();
                if (tcp_trim_head(sk, skb))
                        return -ENOMEM;
        }

...
if (skb->len > cur_mss) {
	if (tcp_fragment(sk, skb, cur_mss, cur_mss))



I could add a BUG_ON(offset_ack == 0) to make sure this assertion is
true.

What do you think ?

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 net-next] tcp: avoid expensive pskb_expand_head() calls
  2012-04-19 11:40                     ` Eric Dumazet
@ 2012-04-19 11:57                       ` Ilpo Järvinen
  2012-04-19 12:44                         ` Eric Dumazet
  0 siblings, 1 reply; 32+ messages in thread
From: Ilpo Järvinen @ 2012-04-19 11:57 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Neal Cardwell, David Miller, netdev, Tom Herbert,
	Maciej Żenczykowski, Yuchung Cheng

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1991 bytes --]

On Thu, 19 Apr 2012, Eric Dumazet wrote:

> On Thu, 2012-04-19 at 13:30 +0200, Eric Dumazet wrote:
> > On Thu, 2012-04-19 at 14:10 +0300, Ilpo Järvinen wrote:
> > 
> > > Now that you have non-zero offset_ack, are the tcp_fragment() callsites 
> > > safe and working? ...I'm mostly worried about tcp_mark_head_lost which 
> > > does some assumptions about tp->snd_una and TCP_SKB_CB(skb)->seq, however, 
> > > also other fragmenting does not preserve offset_ack properly (which might 
> > > not be end of world though)?
> > 
> > Good point, I'll take a look.
> 
> Hmm, the only point this could matter is if a packet is retransmitted.
>
> For other packets, offset_ack = 0 (default value on skb allocation)
> 
> And tcp_retransmit_skb() first call tcp_trim_head(sk, skb) if needed so
> tcp_fragment() is called with == 0
> 
>         if (before(TCP_SKB_CB(skb)->seq, tp->snd_una)) {
>                 if (before(TCP_SKB_CB(skb)->end_seq, tp->snd_una))
>                         BUG();
>                 if (tcp_trim_head(sk, skb))
>                         return -ENOMEM;
>         }
> 
> ...
> if (skb->len > cur_mss) {
> 	if (tcp_fragment(sk, skb, cur_mss, cur_mss))
> 
> 
> 
> I could add a BUG_ON(offset_ack == 0) to make sure this assertion is
> true.

If you end up putting something like that make sure you use WARN_ON 
instead as this surely isn't fatal enough to warrant full stop of the
box :-).

> What do you think ?

I'm not concerned of the output side, that seems to work because 
of the in tcp_retransmit_skb getting rid of the extra first.

The ACK input stuff is more interesting, e.g., this one in 
tcp_mark_head_lost:

	err = tcp_fragment(sk, skb, (packets - oldcnt) * mss, mss);

It splits from TCP_SKB_CB(skb)->seq + (packets - oldcnt) * mss whereas
I think the desired point would be: TCP_SKB_CB(skb)->seq + offset_ack + 
(packets - oldcnt) * mss?

...There is similar case in sacktag code too while it's aligning to mss 
boundaries in tcp_match_skb_to_sack.

-- 
 i.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 net-next] tcp: avoid expensive pskb_expand_head() calls
  2012-04-19 11:57                       ` Ilpo Järvinen
@ 2012-04-19 12:44                         ` Eric Dumazet
  2012-04-20 12:27                           ` Ilpo Järvinen
  0 siblings, 1 reply; 32+ messages in thread
From: Eric Dumazet @ 2012-04-19 12:44 UTC (permalink / raw)
  To: Ilpo Järvinen
  Cc: Neal Cardwell, David Miller, netdev, Tom Herbert,
	Maciej Żenczykowski, Yuchung Cheng

On Thu, 2012-04-19 at 14:57 +0300, Ilpo Järvinen wrote:

> I'm not concerned of the output side, that seems to work because 
> of the in tcp_retransmit_skb getting rid of the extra first.
> 
> The ACK input stuff is more interesting, e.g., this one in 
> tcp_mark_head_lost:
> 
> 	err = tcp_fragment(sk, skb, (packets - oldcnt) * mss, mss);
> 
> It splits from TCP_SKB_CB(skb)->seq + (packets - oldcnt) * mss whereas
> I think the desired point would be: TCP_SKB_CB(skb)->seq + offset_ack + 
> (packets - oldcnt) * mss?
> 
> ...There is similar case in sacktag code too while it's aligning to mss 
> boundaries in tcp_match_skb_to_sack.

Hmm yes, so maybe its safer to update TCP_SKB_CB(skb)->seq in
tcp_tso_acked() (as well as offset_ack) and make needed adjustements in
tcp_fragment() if we find offset_ack being not null.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 net-next] tcp: avoid expensive pskb_expand_head() calls
  2012-04-19 11:30                   ` Eric Dumazet
  2012-04-19 11:40                     ` Eric Dumazet
@ 2012-04-19 13:18                     ` Eric Dumazet
  2012-04-19 13:52                       ` Eric Dumazet
  1 sibling, 1 reply; 32+ messages in thread
From: Eric Dumazet @ 2012-04-19 13:18 UTC (permalink / raw)
  To: Ilpo Järvinen
  Cc: Neal Cardwell, David Miller, netdev, Tom Herbert,
	Maciej Żenczykowski, Yuchung Cheng

On Thu, 2012-04-19 at 13:30 +0200, Eric Dumazet wrote:

> I'll provide a v3 anyway with more performance data, I setup two cards
> in PCI x8 slots to get full bandwidth.

Incidentally, using PCI x8 slots dont anymore trigger the slow path on
unpatched kernel and a single flow (~9410 Mbits)

It seems we are lucky enough to TX complete sent clones before trying to
tcp_trim_head() when processing ACK

Sounds like a timing issue, and fact that drivers batches TX completions
and RX completions.

Also BQL might have changed things a bit here (ixgbe is BQL enabled)

Only if I start several concurrent flows I see the pskb_expand_head()
overhead.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 net-next] tcp: avoid expensive pskb_expand_head() calls
  2012-04-19 13:18                     ` Eric Dumazet
@ 2012-04-19 13:52                       ` Eric Dumazet
  2012-04-19 14:10                         ` Eric Dumazet
  0 siblings, 1 reply; 32+ messages in thread
From: Eric Dumazet @ 2012-04-19 13:52 UTC (permalink / raw)
  To: Ilpo Järvinen
  Cc: Neal Cardwell, David Miller, netdev, Tom Herbert,
	Maciej Żenczykowski, Yuchung Cheng

On Thu, 2012-04-19 at 15:18 +0200, Eric Dumazet wrote:
> On Thu, 2012-04-19 at 13:30 +0200, Eric Dumazet wrote:
> 
> > I'll provide a v3 anyway with more performance data, I setup two cards
> > in PCI x8 slots to get full bandwidth.
> 
> Incidentally, using PCI x8 slots dont anymore trigger the slow path on
> unpatched kernel and a single flow (~9410 Mbits)
> 
> It seems we are lucky enough to TX complete sent clones before trying to
> tcp_trim_head() when processing ACK
> 
> Sounds like a timing issue, and fact that drivers batches TX completions
> and RX completions.
> 
> Also BQL might have changed things a bit here (ixgbe is BQL enabled)
> 
> Only if I start several concurrent flows I see the pskb_expand_head()
> overhead.
> 
> 

And disabling GRO on receiver definitely demonstrates the problem, even
with a single flow. (and performance drops from 9410 Mbit to 6050 Mbit)

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 net-next] tcp: avoid expensive pskb_expand_head() calls
  2012-04-19 13:52                       ` Eric Dumazet
@ 2012-04-19 14:10                         ` Eric Dumazet
  2012-04-19 17:20                           ` Rick Jones
  0 siblings, 1 reply; 32+ messages in thread
From: Eric Dumazet @ 2012-04-19 14:10 UTC (permalink / raw)
  To: Ilpo Järvinen
  Cc: Neal Cardwell, David Miller, netdev, Tom Herbert,
	Maciej Żenczykowski, Yuchung Cheng

On Thu, 2012-04-19 at 15:52 +0200, Eric Dumazet wrote:

> And disabling GRO on receiver definitely demonstrates the problem, even
> with a single flow. (and performance drops from 9410 Mbit to 6050 Mbit)

That insane. 

Performance drops so much because we _drop_ incoming ACKS :

<     TCPSackShifted: 39117
<     TCPSackMerged: 16500
<     TCPSackShiftFallback: 5092
<     TCPBacklogDrop: 27965
---
>     TCPSackShifted: 35122
>     TCPSackMerged: 16368
>     TCPSackShiftFallback: 4889
>     TCPBacklogDrop: 23247

Hmm, maybe we should reduce skb->truesize for small packets before
queueing them in socket backlog...

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 net-next] tcp: avoid expensive pskb_expand_head() calls
  2012-04-19 14:10                         ` Eric Dumazet
@ 2012-04-19 17:20                           ` Rick Jones
  2012-04-19 17:25                             ` Eric Dumazet
  0 siblings, 1 reply; 32+ messages in thread
From: Rick Jones @ 2012-04-19 17:20 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ilpo Järvinen, Neal Cardwell, David Miller, netdev,
	Tom Herbert, Maciej Żenczykowski, Yuchung Cheng

On 04/19/2012 07:10 AM, Eric Dumazet wrote:
> On Thu, 2012-04-19 at 15:52 +0200, Eric Dumazet wrote:
>
>> And disabling GRO on receiver definitely demonstrates the problem, even
>> with a single flow. (and performance drops from 9410 Mbit to 6050 Mbit)
>
> That insane.
>
> Performance drops so much because we _drop_ incoming ACKS :
>
> <      TCPSackShifted: 39117
> <      TCPSackMerged: 16500
> <      TCPSackShiftFallback: 5092
> <      TCPBacklogDrop: 27965
> ---
>>      TCPSackShifted: 35122
>>      TCPSackMerged: 16368
>>      TCPSackShiftFallback: 4889
>>      TCPBacklogDrop: 23247
>
> Hmm, maybe we should reduce skb->truesize for small packets before
> queueing them in socket backlog...

By copying them to smaller buffers? Or just by altering truesize? 
Wasn't the whole point of fixing all the broken truesize settings to 
accurately account for memory consumed?

rick

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 net-next] tcp: avoid expensive pskb_expand_head() calls
  2012-04-19 17:20                           ` Rick Jones
@ 2012-04-19 17:25                             ` Eric Dumazet
  2012-04-19 17:48                               ` Rick Jones
  0 siblings, 1 reply; 32+ messages in thread
From: Eric Dumazet @ 2012-04-19 17:25 UTC (permalink / raw)
  To: Rick Jones
  Cc: Ilpo Järvinen, Neal Cardwell, David Miller, netdev,
	Tom Herbert, Maciej Żenczykowski, Yuchung Cheng

On Thu, 2012-04-19 at 10:20 -0700, Rick Jones wrote:

> By copying them to smaller buffers? Or just by altering truesize? 
> Wasn't the whole point of fixing all the broken truesize settings to 
> accurately account for memory consumed?

I checked, their truesize are OK (1024+256) for ixgbe driver.
They could be little smaller, but not that much. (512 + 256)

No, its only the sk_rcvbuf is small for a tcp sender,
and sk_add_backlog() makes sure we dont queue more than sk_rcvbuf bytes
in backlog.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 net-next] tcp: avoid expensive pskb_expand_head() calls
  2012-04-19 17:25                             ` Eric Dumazet
@ 2012-04-19 17:48                               ` Rick Jones
  2012-04-19 18:00                                 ` Eric Dumazet
  0 siblings, 1 reply; 32+ messages in thread
From: Rick Jones @ 2012-04-19 17:48 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ilpo Järvinen, Neal Cardwell, David Miller, netdev,
	Tom Herbert, Maciej Żenczykowski, Yuchung Cheng

On 04/19/2012 10:25 AM, Eric Dumazet wrote:
> On Thu, 2012-04-19 at 10:20 -0700, Rick Jones wrote:
>
>> By copying them to smaller buffers? Or just by altering truesize?
>> Wasn't the whole point of fixing all the broken truesize settings to
>> accurately account for memory consumed?
>
> I checked, their truesize are OK (1024+256) for ixgbe driver.
> They could be little smaller, but not that much. (512 + 256)
>
> No, its only the sk_rcvbuf is small for a tcp sender,
> and sk_add_backlog() makes sure we dont queue more than sk_rcvbuf
> bytes in backlog.

Sounds like a variation on the theme of wildly divergent 
inbound/outbound bandwidth and constraining ACKs constraining throughput 
- only with buffer sizes.

87380 is the default SO_RCVBUF right?  That should have allowed 
87380/1280 or 68 ACKs to be queued.  Without ACK stretching from GRO 
that should have covered 68 * 2896 or 196928 bytes.  Based on your 
previous 54 usec to transmit 64 KB that would be 162+ usecs to 
accumulate those ACKs, so I guess a question becomes if TCP can be 
held-off processing ACKs for > 162 usecs, and if so and that cannot be 
changed, the autotuning will have to start increasing SO_SNDBUF 
alongside so_sndbuf even if the endpoint is not receiving.  As a 
handwave, since TCP does not know the buffer size(s) used by the driver, 
by 1 MSS for every 2 MSS it adds to SO_SNDBUF.  Or find some way to do 
it "on demand" in the about to drop path.

That or bare ACKs have to be excluded from the overhead checks somehow I 
guess, or there be more aggressive copying into smaller buffers.

Thankfully, when applications make explicit setsockopt() calls, they 
tend (ok, perhaps that is a bit of a guess) to set both SO_SNDBUF and 
SO_RCVBUF at the same time.

rick

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 net-next] tcp: avoid expensive pskb_expand_head() calls
  2012-04-19 17:48                               ` Rick Jones
@ 2012-04-19 18:00                                 ` Eric Dumazet
  2012-04-19 18:05                                   ` Rick Jones
  0 siblings, 1 reply; 32+ messages in thread
From: Eric Dumazet @ 2012-04-19 18:00 UTC (permalink / raw)
  To: Rick Jones
  Cc: Ilpo Järvinen, Neal Cardwell, David Miller, netdev,
	Tom Herbert, Maciej Żenczykowski, Yuchung Cheng

On Thu, 2012-04-19 at 10:48 -0700, Rick Jones wrote:
> On 04/19/2012 10:25 AM, Eric Dumazet wrote:
> > On Thu, 2012-04-19 at 10:20 -0700, Rick Jones wrote:
> >
> >> By copying them to smaller buffers? Or just by altering truesize?
> >> Wasn't the whole point of fixing all the broken truesize settings to
> >> accurately account for memory consumed?
> >
> > I checked, their truesize are OK (1024+256) for ixgbe driver.
> > They could be little smaller, but not that much. (512 + 256)
> >
> > No, its only the sk_rcvbuf is small for a tcp sender,
> > and sk_add_backlog() makes sure we dont queue more than sk_rcvbuf
> > bytes in backlog.
> 
> Sounds like a variation on the theme of wildly divergent 
> inbound/outbound bandwidth and constraining ACKs constraining throughput 
> - only with buffer sizes.
> 
> 87380 is the default SO_RCVBUF right?  That should have allowed 
> 87380/1280 or 68 ACKs to be queued.  Without ACK stretching from GRO 
> that should have covered 68 * 2896 or 196928 bytes.  Based on your 
> previous 54 usec to transmit 64 KB that would be 162+ usecs to 
> accumulate those ACKs, so I guess a question becomes if TCP can be 
> held-off processing ACKs for > 162 usecs, and if so and that cannot be 
> changed, the autotuning will have to start increasing SO_SNDBUF 
> alongside so_sndbuf even if the endpoint is not receiving.  As a 
> handwave, since TCP does not know the buffer size(s) used by the driver, 
> by 1 MSS for every 2 MSS it adds to SO_SNDBUF.  Or find some way to do 
> it "on demand" in the about to drop path.
> 
> That or bare ACKs have to be excluded from the overhead checks somehow I 
> guess, or there be more aggressive copying into smaller buffers.
> 

Nope, we need a limit or risk OOM if a malicious peer send ACK flood ;)

To be clear, if I change the tcp_rmem[1] from 87380 to big value, I no
longer have ACK drops, but still poor performance, I am still
investigating.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 net-next] tcp: avoid expensive pskb_expand_head() calls
  2012-04-19 18:00                                 ` Eric Dumazet
@ 2012-04-19 18:05                                   ` Rick Jones
  0 siblings, 0 replies; 32+ messages in thread
From: Rick Jones @ 2012-04-19 18:05 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ilpo Järvinen, Neal Cardwell, David Miller, netdev,
	Tom Herbert, Maciej Żenczykowski, Yuchung Cheng


>> That or bare ACKs have to be excluded from the overhead checks somehow I
>> guess, or there be more aggressive copying into smaller buffers.
>>
>
> Nope, we need a limit or risk OOM if a malicious peer send ACK flood ;)

Well, there is that...

>
> To be clear, if I change the tcp_rmem[1] from 87380 to big value, I no
> longer have ACK drops, but still poor performance, I am still
> investigating.

What happens if you set net.core.[rw]mem_max to 4 MB and then use 
something like -s 1M -S 1M in netperf?

netperf -t TCP_STREAM -H <remote> -- -s 1M -S 1M -m 64K

(or -m 16K if you want to keep the send size the same as your previous 
tests...)

rick

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 net-next] tcp: avoid expensive pskb_expand_head() calls
  2012-04-19 12:44                         ` Eric Dumazet
@ 2012-04-20 12:27                           ` Ilpo Järvinen
  0 siblings, 0 replies; 32+ messages in thread
From: Ilpo Järvinen @ 2012-04-20 12:27 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Neal Cardwell, David Miller, netdev, Tom Herbert,
	Maciej Żenczykowski, Yuchung Cheng

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2004 bytes --]

On Thu, 19 Apr 2012, Eric Dumazet wrote:

> On Thu, 2012-04-19 at 14:57 +0300, Ilpo Järvinen wrote:
> 
> > I'm not concerned of the output side, that seems to work because 
> > of the in tcp_retransmit_skb getting rid of the extra first.
> > 
> > The ACK input stuff is more interesting, e.g., this one in 
> > tcp_mark_head_lost:
> > 
> > 	err = tcp_fragment(sk, skb, (packets - oldcnt) * mss, mss);
> > 
> > It splits from TCP_SKB_CB(skb)->seq + (packets - oldcnt) * mss whereas
> > I think the desired point would be: TCP_SKB_CB(skb)->seq + offset_ack + 
> > (packets - oldcnt) * mss?
> > 
> > ...There is similar case in sacktag code too while it's aligning to mss 
> > boundaries in tcp_match_skb_to_sack.
> 
> Hmm yes, so maybe its safer to update TCP_SKB_CB(skb)->seq in
> tcp_tso_acked() (as well as offset_ack) and make needed adjustements in
> tcp_fragment() if we find offset_ack being not null.

I suppose that somewhat works, it helps here a lot that you work only with 
the head skb making lot of cases not possible... I once made something 
similar (before I came up the current shift/merge approach) and ended up 
to this:
  http://www.mail-archive.com/netdev@vger.kernel.org/msg56191.html

...But...

There's another can of worms still it seems.... At least tcp_skb_seglen 
that returns weird results when pcount becomes 1!

...Somewhat related to the pcount == 1 problem, I've long wondered if the 
zeroed gso_size with pcount == 1 is worth keeping in the first place?

However, I kind of liked the neatness in the original approach where ->seq 
does not lie. That would have kept most of stuff very localized because 
each skb is still fully valid and consistent with itself, whereas 
introducing lies adds lots of hidden traps (except for the pcount of 
course, but consistency for it has not been there for some years
already :-)). The tcp_match_skb_to_sack code seems to actually work 
exactly because of this consistency (if I now on the second/third read got 
it right).


-- 
 i.

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2012-04-20 12:27 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-04-17  9:06 [BUG] ixgbe: something wrong with queue selection ? Eric Dumazet
2012-04-17  9:16 ` Jeff Kirsher
2012-04-17 16:01   ` Alexander Duyck
2012-04-17 16:38     ` John Fastabend
2012-04-17 17:07       ` Ben Hutchings
2012-04-17 16:46     ` Eric Dumazet
2012-04-17 21:38       ` TSO not 10G friendly if peer is close enough Eric Dumazet
2012-04-17 21:47         ` David Miller
2012-04-18  3:00           ` Eric Dumazet
2012-04-18 15:49         ` [PATCH net-next] tcp: avoid expensive pskb_expand_head() calls Eric Dumazet
     [not found]           ` <4F8EF317.10504@hp.com>
2012-04-18 17:16             ` Eric Dumazet
2012-04-18 17:30               ` Rick Jones
2012-04-18 17:40                 ` Eric Dumazet
2012-04-18 18:40           ` Neal Cardwell
2012-04-18 19:18             ` Eric Dumazet
2012-04-18 19:51               ` [PATCH v2 " Eric Dumazet
2012-04-19 11:10                 ` Ilpo Järvinen
2012-04-19 11:30                   ` Eric Dumazet
2012-04-19 11:40                     ` Eric Dumazet
2012-04-19 11:57                       ` Ilpo Järvinen
2012-04-19 12:44                         ` Eric Dumazet
2012-04-20 12:27                           ` Ilpo Järvinen
2012-04-19 13:18                     ` Eric Dumazet
2012-04-19 13:52                       ` Eric Dumazet
2012-04-19 14:10                         ` Eric Dumazet
2012-04-19 17:20                           ` Rick Jones
2012-04-19 17:25                             ` Eric Dumazet
2012-04-19 17:48                               ` Rick Jones
2012-04-19 18:00                                 ` Eric Dumazet
2012-04-19 18:05                                   ` Rick Jones
2012-04-18 19:41           ` [PATCH " Vijay Subramanian
2012-04-18 19:49             ` Eric Dumazet

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.