[PATCH 2/2 net-next] tcp: sk_add_backlog() is too agressive for TCP

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 2/2 net-next] tcp: sk_add_backlog() is too agressive for TCP
@ 2012-04-23  9:38 Eric Dumazet
  2012-04-23 17:14 ` Rick Jones
  0 siblings, 1 reply; 37+ messages in thread
From: Eric Dumazet @ 2012-04-23  9:38 UTC (permalink / raw)
  To: David Miller
  Cc: netdev, Tom Herbert, Neal Cardwell, Maciej Żenczykowski,
	Yuchung Cheng, Ilpo Järvinen, Rick Jones

From: Eric Dumazet <edumazet@google.com>

While investigating TCP performance problems on 10Gb+ links, we found a
tcp sender was dropping lot of incoming ACKS because of sk_rcvbuf limit
in sk_add_backlog(), especially if receiver doesnt use GRO/LRO and sends
one ACK every two MSS segments.

A sender usually tweaks sk_sndbuf, but sk_rcvbuf stays at its default
value (87380), allowing a too small backlog.

A TCP ACK, even being small, can consume nearly same truesize space than
outgoing packets. Using sk_rcvbuf + sk_sndbuf as a limit makes sense and
is fast to compute.

Performance results on netperf, single flow, receiver with disabled
GRO/LRO : 7500 Mbits instead of 6050 Mbits, no more TCPBacklogDrop
increments at sender.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Maciej Żenczykowski <maze@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Cc: Rick Jones <rick.jones2@hp.com>
---
 net/ipv4/tcp_ipv4.c |    3 ++-
 net/ipv6/tcp_ipv6.c |    3 ++-
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 917607e..cf97e98 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1752,7 +1752,8 @@ process:
 			if (!tcp_prequeue(sk, skb))
 				ret = tcp_v4_do_rcv(sk, skb);
 		}
-	} else if (unlikely(sk_add_backlog(sk, skb, sk->sk_rcvbuf))) {
+	} else if (unlikely(sk_add_backlog(sk, skb,
+					   sk->sk_rcvbuf + sk->sk_sndbuf))) {
 		bh_unlock_sock(sk);
 		NET_INC_STATS_BH(net, LINUX_MIB_TCPBACKLOGDROP);
 		goto discard_and_relse;
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index b04e6d8..5fb19d3 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1654,7 +1654,8 @@ process:
 			if (!tcp_prequeue(sk, skb))
 				ret = tcp_v6_do_rcv(sk, skb);
 		}
-	} else if (unlikely(sk_add_backlog(sk, skb, sk->sk_rcvbuf))) {
+	} else if (unlikely(sk_add_backlog(sk, skb,
+					   sk->sk_rcvbuf + sk->sk_sndbuf))) {
 		bh_unlock_sock(sk);
 		NET_INC_STATS_BH(net, LINUX_MIB_TCPBACKLOGDROP);
 		goto discard_and_relse;

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2 net-next] tcp: sk_add_backlog() is too agressive for TCP
  2012-04-23  9:38 [PATCH 2/2 net-next] tcp: sk_add_backlog() is too agressive for TCP Eric Dumazet
@ 2012-04-23 17:14 ` Rick Jones
  2012-04-23 17:23   ` Eric Dumazet
  0 siblings, 1 reply; 37+ messages in thread
From: Rick Jones @ 2012-04-23 17:14 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, netdev, Tom Herbert, Neal Cardwell,
	Maciej Żenczykowski, Yuchung Cheng, Ilpo Järvinen

On 04/23/2012 02:38 AM, Eric Dumazet wrote:
> From: Eric Dumazet<edumazet@google.com>
>
> While investigating TCP performance problems on 10Gb+ links, we found a
> tcp sender was dropping lot of incoming ACKS because of sk_rcvbuf limit
> in sk_add_backlog(), especially if receiver doesnt use GRO/LRO and sends
> one ACK every two MSS segments.
>
> A sender usually tweaks sk_sndbuf, but sk_rcvbuf stays at its default
> value (87380), allowing a too small backlog.
>
> A TCP ACK, even being small, can consume nearly same truesize space than
> outgoing packets. Using sk_rcvbuf + sk_sndbuf as a limit makes sense and
> is fast to compute.
>
> Performance results on netperf, single flow, receiver with disabled
> GRO/LRO : 7500 Mbits instead of 6050 Mbits, no more TCPBacklogDrop
> increments at sender.
>
> Signed-off-by: Eric Dumazet<edumazet@google.com>
> Cc: Neal Cardwell<ncardwell@google.com>
> Cc: Tom Herbert<therbert@google.com>
> Cc: Maciej Żenczykowski<maze@google.com>
> Cc: Yuchung Cheng<ycheng@google.com>
> Cc: Ilpo Järvinen<ilpo.jarvinen@helsinki.fi>
> Cc: Rick Jones<rick.jones2@hp.com>
> ---
>   net/ipv4/tcp_ipv4.c |    3 ++-
>   net/ipv6/tcp_ipv6.c |    3 ++-
>   2 files changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index 917607e..cf97e98 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -1752,7 +1752,8 @@ process:
>   			if (!tcp_prequeue(sk, skb))
>   				ret = tcp_v4_do_rcv(sk, skb);
>   		}
> -	} else if (unlikely(sk_add_backlog(sk, skb, sk->sk_rcvbuf))) {
> +	} else if (unlikely(sk_add_backlog(sk, skb,
> +					   sk->sk_rcvbuf + sk->sk_sndbuf))) {
>   		bh_unlock_sock(sk);
>   		NET_INC_STATS_BH(net, LINUX_MIB_TCPBACKLOGDROP);
>   		goto discard_and_relse;

This will increase what can be queued for arriving segments in general 
and not for ACKs specifically yes?  (A possible issue that would have 
come-up with my previous wondering about just increasing SO_RCVBUF as 
SO_SNDBUF was increasing).  Perhaps only add sk->sk_sndbuf to the limit 
if the arriving segment contains no data?

rick


> diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
> index b04e6d8..5fb19d3 100644
> --- a/net/ipv6/tcp_ipv6.c
> +++ b/net/ipv6/tcp_ipv6.c
> @@ -1654,7 +1654,8 @@ process:
>   			if (!tcp_prequeue(sk, skb))
>   				ret = tcp_v6_do_rcv(sk, skb);
>   		}
> -	} else if (unlikely(sk_add_backlog(sk, skb, sk->sk_rcvbuf))) {
> +	} else if (unlikely(sk_add_backlog(sk, skb,
> +					   sk->sk_rcvbuf + sk->sk_sndbuf))) {
>   		bh_unlock_sock(sk);
>   		NET_INC_STATS_BH(net, LINUX_MIB_TCPBACKLOGDROP);
>   		goto discard_and_relse;
>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2 net-next] tcp: sk_add_backlog() is too agressive for TCP
  2012-04-23 17:14 ` Rick Jones
@ 2012-04-23 17:23   ` Eric Dumazet
  2012-04-23 20:01     ` David Miller
  0 siblings, 1 reply; 37+ messages in thread
From: Eric Dumazet @ 2012-04-23 17:23 UTC (permalink / raw)
  To: Rick Jones
  Cc: David Miller, netdev, Tom Herbert, Neal Cardwell,
	Maciej Żenczykowski, Yuchung Cheng, Ilpo Järvinen

On Mon, 2012-04-23 at 10:14 -0700, Rick Jones wrote:

> 
> This will increase what can be queued for arriving segments in general 
> and not for ACKs specifically yes?  (A possible issue that would have 
> come-up with my previous wondering about just increasing SO_RCVBUF as 
> SO_SNDBUF was increasing).  Perhaps only add sk->sk_sndbuf to the limit 
> if the arriving segment contains no data?

Thats the backlog limit that we tweak here.

Its not a big deal if we allow a bit more packets to come and later drop
them if we hit the real rcvbuf limit. (ACKS wont consume space, since
they are freed as soon as processed)

By the way, we used to have (sk_rcvbuf << 1) limit in the past.

Before commit c377411f2494a931ff we had :

if (sk->sk_backlog.len >= max(sk->sk_backlog.limit, sk->sk_rcvbuf << 1)) 
	return -ENOBUFS

We probably had drops in the past but didnt notice, since we lacked 
a counter for those drops.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2 net-next] tcp: sk_add_backlog() is too agressive for TCP
  2012-04-23 17:23   ` Eric Dumazet
@ 2012-04-23 20:01     ` David Miller
  2012-04-23 20:37       ` Eric Dumazet
  0 siblings, 1 reply; 37+ messages in thread
From: David Miller @ 2012-04-23 20:01 UTC (permalink / raw)
  To: eric.dumazet
  Cc: rick.jones2, netdev, therbert, ncardwell, maze, ycheng, ilpo.jarvinen

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 23 Apr 2012 19:23:15 +0200

> On Mon, 2012-04-23 at 10:14 -0700, Rick Jones wrote:
> 
>> 
>> This will increase what can be queued for arriving segments in general 
>> and not for ACKs specifically yes?  (A possible issue that would have 
>> come-up with my previous wondering about just increasing SO_RCVBUF as 
>> SO_SNDBUF was increasing).  Perhaps only add sk->sk_sndbuf to the limit 
>> if the arriving segment contains no data?
> 
> Thats the backlog limit that we tweak here.
> 
> Its not a big deal if we allow a bit more packets to come and later drop
> them if we hit the real rcvbuf limit. (ACKS wont consume space, since
> they are freed as soon as processed)

Hmmm... why don't we just acknowledge reality and special case ACKs?

If a TCP packet is dataless we should just let it go through no matter
what and with no limits.  It is by definition transient and will not
get queued up into the socket past this backlog stage.

This proposed patch allows non-dataless packets to eat more space in
the backlog, thus the concern and slight pushback.  And from another
perspective, having the stack process data packets which will just
get dropped when we try to attach it to the receive queue is just
wasted work.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2 net-next] tcp: sk_add_backlog() is too agressive for TCP
  2012-04-23 20:01     ` David Miller
@ 2012-04-23 20:37       ` Eric Dumazet
  2012-04-23 20:57         ` Rick Jones
                           ` (3 more replies)
  0 siblings, 4 replies; 37+ messages in thread
From: Eric Dumazet @ 2012-04-23 20:37 UTC (permalink / raw)
  To: David Miller
  Cc: rick.jones2, netdev, therbert, ncardwell, maze, ycheng, ilpo.jarvinen

On Mon, 2012-04-23 at 16:01 -0400, David Miller wrote:

> Hmmm... why don't we just acknowledge reality and special case ACKs?
> 

Yes why not.


> If a TCP packet is dataless we should just let it go through no matter
> what and with no limits.  It is by definition transient and will not
> get queued up into the socket past this backlog stage.
> 

Even being transient we need a limit. Without copybreak, an ACK can cost
2048+256 bytes.

In my 10Gbit tests (standard netperf using 16K buffers), I've seen
backlogs of 300 ACK packets...

> This proposed patch allows non-dataless packets to eat more space in
> the backlog, thus the concern and slight pushback.  And from another
> perspective, having the stack process data packets which will just
> get dropped when we try to attach it to the receive queue is just
> wasted work.

We could try to coalesce ACKs before backlogging them. I'll work on
this.

Thanks

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2 net-next] tcp: sk_add_backlog() is too agressive for TCP
  2012-04-23 20:37       ` Eric Dumazet
@ 2012-04-23 20:57         ` Rick Jones
  2012-04-23 21:30           ` Eric Dumazet
  2012-04-23 21:01         ` David Miller
                           ` (2 subsequent siblings)
  3 siblings, 1 reply; 37+ messages in thread
From: Rick Jones @ 2012-04-23 20:57 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, netdev, therbert, ncardwell, maze, ycheng, ilpo.jarvinen

On 04/23/2012 01:37 PM, Eric Dumazet wrote:
> In my 10Gbit tests (standard netperf using 16K buffers), I've seen
> backlogs of 300 ACK packets...

Probably better to call that something other than 16K buffers - the send 
size was probably 16K, which reflected SO_SNDBUF at the time the data 
socket was created, but clearly SO_SNDBUF grew in that timeframe.

And those values are "standard" for netperf only in the context of 
(default) Linux - on other platforms the defaults in the stack and so in 
netperf are probably different.

The classic/migrated classic tests report only the initial socket buffer 
sizes, not what they become by the end of the test:

raj@tardy:~/netperf2_trunk/src$ ./netperf -H 192.168.1.3
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 
192.168.1.3 () port 0 AF_INET
Recv   Send    Send
Socket Socket  Message  Elapsed
Size   Size    Size     Time     Throughput
bytes  bytes   bytes    secs.    10^6bits/sec

  87380  16384  16384    10.00     941.06

To see what they are at the end of the test requires more direct use of 
the omni path.  Either by way of test type:

raj@tardy:~/netperf2_trunk/src$ ./netperf -H 192.168.1.3 -t omni
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.1.3 () 
port 0 AF_INET
Local       Remote      Local  Elapsed Throughput Throughput
Send Socket Recv Socket Send   Time               Units
Size        Size        Size   (sec)
Final       Final
266640      87380       16384  10.00   940.92     10^6bits/s

or omni output selection:

raj@tardy:~/netperf2_trunk/src$ ./netperf -H 192.168.1.3 -- -k 
lss_size_req,lss_size,lss_size_end,rsr_size_req,rsr_size,rsr_size_end
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 
192.168.1.3 () port 0 AF_INET
LSS_SIZE_REQ=-1
LSS_SIZE=16384
LSS_SIZE_END=266640
RSR_SIZE_REQ=-1
RSR_SIZE=87380
RSR_SIZE_END=87380

BTW, does it make sense that the SO_SNDBUF size on the netperf side 
(lss_size_end - 2.6.38-14-generic kernel) grew larger than the SO_RCVBUF 
on the netserver side? (3.2.0-rc4+)

rick jones

PS - here is data flowing the other way:
raj@tardy:~/netperf2_trunk/src$ ./netperf -H 192.168.1.3 -t TCP_MAERTS 
-- -k lsr_size_req,lsr_size,lsr_size_end,rss_size_req,rss_size,rss_size_end
MIGRATED TCP MAERTS TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 
192.168.1.3 () port 0 AF_INET
LSR_SIZE_REQ=-1
LSR_SIZE=87380
LSR_SIZE_END=4194304
RSS_SIZE_REQ=-1
RSS_SIZE=16384
RSS_SIZE_END=65536

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2 net-next] tcp: sk_add_backlog() is too agressive for TCP
  2012-04-23 20:37       ` Eric Dumazet
  2012-04-23 20:57         ` Rick Jones
@ 2012-04-23 21:01         ` David Miller
  2012-04-23 21:38           ` Eric Dumazet
  2012-04-24  2:20         ` Eric Dumazet
  2012-04-24  8:44         ` David Laight
  3 siblings, 1 reply; 37+ messages in thread
From: David Miller @ 2012-04-23 21:01 UTC (permalink / raw)
  To: eric.dumazet
  Cc: rick.jones2, netdev, therbert, ncardwell, maze, ycheng, ilpo.jarvinen

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 23 Apr 2012 22:37:26 +0200

> We could try to coalesce ACKs before backlogging them. I'll work on
> this.

Great idea, although I wonder about the effect this could have on RTT
measurements.  Instead of having N RTT measurements, we'd have just
one.

Granted, what happens right now wrt. RTT measurements with such huge
ACK backlogs isn't all that nice either.

Ideally, perhaps, we'd do a timestamp diff at the time we insert the
packet into the backlog.  That way we wouldn't gain the RTT inaccuracy
introduced by such queueing delays and ACK backlogs.

Another way to look at it is that the coalesced scheme would actually
improve RTT measurements, since the most accurate (and least
"delayed") of the timestamps would be the only one processed :-)

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2 net-next] tcp: sk_add_backlog() is too agressive for TCP
  2012-04-23 20:57         ` Rick Jones
@ 2012-04-23 21:30           ` Eric Dumazet
  2012-04-23 21:51             ` Rick Jones
  0 siblings, 1 reply; 37+ messages in thread
From: Eric Dumazet @ 2012-04-23 21:30 UTC (permalink / raw)
  To: Rick Jones
  Cc: David Miller, netdev, therbert, ncardwell, maze, ycheng, ilpo.jarvinen

On Mon, 2012-04-23 at 13:57 -0700, Rick Jones wrote:
> On 04/23/2012 01:37 PM, Eric Dumazet wrote:
> > In my 10Gbit tests (standard netperf using 16K buffers), I've seen
> > backlogs of 300 ACK packets...
> 
> Probably better to call that something other than 16K buffers - the send 
> size was probably 16K, which reflected SO_SNDBUF at the time the data 
> socket was created, but clearly SO_SNDBUF grew in that timeframe.
> 


Maybe I was not clear : Application does sendmsg() of 16KB buffers.

Yet, in the small time it takes to perform this operation, softirq can
queue up to 300 packets coming from the other side.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2 net-next] tcp: sk_add_backlog() is too agressive for TCP
  2012-04-23 21:01         ` David Miller
@ 2012-04-23 21:38           ` Eric Dumazet
  0 siblings, 0 replies; 37+ messages in thread
From: Eric Dumazet @ 2012-04-23 21:38 UTC (permalink / raw)
  To: David Miller
  Cc: rick.jones2, netdev, therbert, ncardwell, maze, ycheng, ilpo.jarvinen

On Mon, 2012-04-23 at 17:01 -0400, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Mon, 23 Apr 2012 22:37:26 +0200
> 
> > We could try to coalesce ACKs before backlogging them. I'll work on
> > this.
> 
> Great idea, although I wonder about the effect this could have on RTT
> measurements.  Instead of having N RTT measurements, we'd have just
> one.
> 
> Granted, what happens right now wrt. RTT measurements with such huge
> ACK backlogs isn't all that nice either.
> 
> Ideally, perhaps, we'd do a timestamp diff at the time we insert the
> packet into the backlog.  That way we wouldn't gain the RTT inaccuracy
> introduced by such queueing delays and ACK backlogs.
> 
> Another way to look at it is that the coalesced scheme would actually
> improve RTT measurements, since the most accurate (and least
> "delayed") of the timestamps would be the only one processed :-)

The big part of the work is not doing the coalesce, but also counting
the number of ACKS that are going to be carried into TCP stack if we
want cwnd being updated correctly.

Basically I'll have to add a new skb field (in cb[]) to properly count
number of ACKS 'included' in a single packet.

About the RTT, some congestion modules need TCP_CONG_RTT_STAMP but time
is taken when backlog processing is done, that is after backlog in/out

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2 net-next] tcp: sk_add_backlog() is too agressive for TCP
  2012-04-23 21:30           ` Eric Dumazet
@ 2012-04-23 21:51             ` Rick Jones
  2012-04-23 21:56               ` Rick Jones
  2012-04-23 22:05               ` Eric Dumazet
  0 siblings, 2 replies; 37+ messages in thread
From: Rick Jones @ 2012-04-23 21:51 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, netdev, therbert, ncardwell, maze, ycheng, ilpo.jarvinen

On 04/23/2012 02:30 PM, Eric Dumazet wrote:
> On Mon, 2012-04-23 at 13:57 -0700, Rick Jones wrote:
>> Probably better to call that something other than 16K buffers - the send
>> size was probably 16K, which reflected SO_SNDBUF at the time the data
>> socket was created, but clearly SO_SNDBUF grew in that timeframe.
>>
>
>
> Maybe I was not clear : Application does sendmsg() of 16KB buffers.

I'd probably call that a 16K send test.  The root of the issue being 
there being "send buffers" and "send socket buffers" (and their receive 
versions).

My "canonical" test - at least one that appears in most of my 
contemporary scripts uses a 64K send size for the bulk transfer tests. 
I switch back-and-forth between tests which allow the socket buffer size 
to be determined automagically, and those where I set both sides' socket 
buffers to 1M via the test-specific -s and -S options.  In "netperf 
speak" those would probably be "x64K" and "1Mx64k" respectively.  More 
generally "<socket buffer size>x<send size>" (I rarely set/specify the 
receive size in those tests, leaving it at whatever SO_RCVBUF is at the 
start.

> Yet, in the small time it takes to perform this operation, softirq can
> queue up to 300 packets coming from the other side.

There is more to it than just queue-up 16 KB right?

rick

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2 net-next] tcp: sk_add_backlog() is too agressive for TCP
  2012-04-23 21:51             ` Rick Jones
@ 2012-04-23 21:56               ` Rick Jones
  2012-04-23 22:05               ` Eric Dumazet
  1 sibling, 0 replies; 37+ messages in thread
From: Rick Jones @ 2012-04-23 21:56 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, netdev, therbert, ncardwell, maze, ycheng, ilpo.jarvinen

On 04/23/2012 02:51 PM, Rick Jones wrote:
> On 04/23/2012 02:30 PM, Eric Dumazet wrote:
>> On Mon, 2012-04-23 at 13:57 -0700, Rick Jones wrote:
>>> Probably better to call that something other than 16K buffers - the send
>>> size was probably 16K, which reflected SO_SNDBUF at the time the data
>>> socket was created, but clearly SO_SNDBUF grew in that timeframe.
>>>
>>
>>
>> Maybe I was not clear : Application does sendmsg() of 16KB buffers.
>
> I'd probably call that a 16K send test. The root of the issue being
> there being "send buffers" and "send socket buffers" (and their receive
> versions).
>
> My "canonical" test - at least one that appears in most of my
> contemporary scripts uses a 64K send size for the bulk transfer tests. I
> switch back-and-forth between tests which allow the socket buffer size
> to be determined automagically, and those where I set both sides' socket
> buffers to 1M via the test-specific -s and -S options. In "netperf
> speak" those would probably be "x64K" and "1Mx64k" respectively. More
> generally "<socket buffer size>x<send size>" (I rarely set/specify the
> receive size in those tests, leaving it at whatever SO_RCVBUF is at the
> start.
>
>> Yet, in the small time it takes to perform this operation, softirq can
>> queue up to 300 packets coming from the other side.
>
> There is more to it than just queue-up 16 KB right?

I should have added that 300 ACKs seems huge as a backlog.  At 
ack-every-other that is 300 * 1448 * 2 or 868800 bytes worth of ACKs. 
That sounds like a great deal more than just one 16KB send's worth of 
being held-off.  I mean at 10Gbe speeds (using your 54 usec for 64KB) 
that represents data which took something like three quarters of a 
millisecond to transmit on the wire.

rick

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2 net-next] tcp: sk_add_backlog() is too agressive for TCP
  2012-04-23 21:51             ` Rick Jones
  2012-04-23 21:56               ` Rick Jones
@ 2012-04-23 22:05               ` Eric Dumazet
  2012-04-23 22:16                 ` Rick Jones
  1 sibling, 1 reply; 37+ messages in thread
From: Eric Dumazet @ 2012-04-23 22:05 UTC (permalink / raw)
  To: Rick Jones
  Cc: David Miller, netdev, therbert, ncardwell, maze, ycheng, ilpo.jarvinen

On Mon, 2012-04-23 at 14:51 -0700, Rick Jones wrote:
> On 04/23/2012 02:30 PM, Eric Dumazet wrote:

> > Yet, in the small time it takes to perform this operation, softirq can
> > queue up to 300 packets coming from the other side.
> 
> There is more to it than just queue-up 16 KB right?

At full rate, we send 825.000 packets per second, and should receive
412.000 ACKS per second if receiver is standard TCP.

The ACK are not smooth, because receiver also have a huge backlog issue
and can send train of ACKS. (I have seen backlogs on receiver using more
than 500 us to be  processed)

If the copyin(16KB) from user to kernel takes some us (preempt,
irqs...), its pretty easy to catch an ACK train in this window.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2 net-next] tcp: sk_add_backlog() is too agressive for TCP
  2012-04-23 22:05               ` Eric Dumazet
@ 2012-04-23 22:16                 ` Rick Jones
  2012-04-24 15:25                   ` Christoph Lameter
  0 siblings, 1 reply; 37+ messages in thread
From: Rick Jones @ 2012-04-23 22:16 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, netdev, therbert, ncardwell, maze, ycheng, ilpo.jarvinen

On 04/23/2012 03:05 PM, Eric Dumazet wrote:
> On Mon, 2012-04-23 at 14:51 -0700, Rick Jones wrote:
>> On 04/23/2012 02:30 PM, Eric Dumazet wrote:
>
>>> Yet, in the small time it takes to perform this operation, softirq can
>>> queue up to 300 packets coming from the other side.
>>
>> There is more to it than just queue-up 16 KB right?
>
> At full rate, we send 825.000 packets per second, and should receive
> 412.000 ACKS per second if receiver is standard TCP.
>
> The ACK are not smooth, because receiver also have a huge backlog issue
> and can send train of ACKS. (I have seen backlogs on receiver using more
> than 500 us to be  processed)
>
> If the copyin(16KB) from user to kernel takes some us (preempt,
> irqs...), its pretty easy to catch an ACK train in this window.

Is it at all possible to have the copies happen without the connection 
being locked?  If indeed it is possible to be held-off with the 
connection locked for the better part of 3/4 of a millisecond, just what 
will that do to 40 or 100 GbE?  If you've been seeing queues of 300 ACKs 
at 10 GbE that would be 3000 at 100 GbE, and assuming those are all in a 
2048 byte buffer thats 6MB just of ACKs.  I suppose 100GbE does mean 
non-trivial quantities of buffering anyway but that does still seem 
rather high.

rick
thank goodness for GRO's ACK stretching as an ACK avoidance heuristic I 
guess...

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2 net-next] tcp: sk_add_backlog() is too agressive for TCP
  2012-04-23 20:37       ` Eric Dumazet
  2012-04-23 20:57         ` Rick Jones
  2012-04-23 21:01         ` David Miller
@ 2012-04-24  2:20         ` Eric Dumazet
  2012-04-24  2:27           ` David Miller
  2012-04-24  8:01           ` Ilpo Järvinen
  2012-04-24  8:44         ` David Laight
  3 siblings, 2 replies; 37+ messages in thread
From: Eric Dumazet @ 2012-04-24  2:20 UTC (permalink / raw)
  To: David Miller
  Cc: rick.jones2, netdev, therbert, ncardwell, maze, ycheng, ilpo.jarvinen

On Mon, 2012-04-23 at 22:37 +0200, Eric Dumazet wrote:

> We could try to coalesce ACKs before backlogging them. I'll work on
> this.
> 

I did an experiment, and found a basic coalescing was not working in
case of packet loss and SACK storm.

Doing a smart coalescing in this case sounds really complex.

Should we really continue this way ? 


 include/net/tcp.h   |    1 +
 net/ipv4/tcp_ipv4.c |   32 +++++++++++++++++++++++++++++++-
 2 files changed, 32 insertions(+), 1 deletion(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index fc880e9..de8d847 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1418,6 +1418,7 @@ static inline unsigned int tcp_stream_is_thin(struct tcp_sock *tp)
 	return tp->packets_out < 4 && !tcp_in_initial_slowstart(tp);
 }
 
+extern bool tcp_ack_coalesce(struct sock *sk, struct sk_buff *skb);
 /* /proc */
 enum tcp_seq_states {
 	TCP_SEQ_STATE_LISTENING,
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 0883921..b5a3bac 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1670,6 +1670,36 @@ csum_err:
 }
 EXPORT_SYMBOL(tcp_v4_do_rcv);
 
+/* socket is owned by user.
+ * Before queuing this skb into backlog, try to coalesce it to previous skb.
+ * We only take care of pure ACKS.
+ */
+bool tcp_ack_coalesce(struct sock *sk, struct sk_buff *skb)
+{
+	struct sk_buff *prev = sk->sk_backlog.tail;
+	const struct tcphdr *th, *thp;
+	unsigned int i, thlen;
+
+	if (TCP_SKB_CB(skb)->seq != TCP_SKB_CB(skb)->end_seq ||
+	    !prev ||
+	    TCP_SKB_CB(skb)->seq != TCP_SKB_CB(prev)->end_seq)
+		return false;
+	th = tcp_hdr(skb);
+	thp = tcp_hdr(prev);
+	thlen = th->doff * 4;
+	i = sizeof(th->source) + sizeof(th->dest) +
+	    sizeof(th->seq) + sizeof(th->ack_seq);
+	for (; i < thlen; i += 4) {
+		if (*(u32 *)((u8 *)th + i) != *(u32 *)((u8 *)thp + i))
+			return false;
+	}
+	if (after(TCP_SKB_CB(skb)->ack_seq, TCP_SKB_CB(prev)->ack_seq))
+		TCP_SKB_CB(prev)->ack_seq = TCP_SKB_CB(skb)->ack_seq;
+	consume_skb(skb);
+	return true;
+}
+EXPORT_SYMBOL(tcp_ack_coalesce);
+
 /*
  *	From tcp_input.c
  */
@@ -1752,7 +1782,7 @@ process:
 			if (!tcp_prequeue(sk, skb))
 				ret = tcp_v4_do_rcv(sk, skb);
 		}
-	} else if (unlikely(sk_add_backlog(sk, skb))) {
+	} else if (!tcp_ack_coalesce(sk, skb) && sk_add_backlog(sk, skb)) {
 		bh_unlock_sock(sk);
 		NET_INC_STATS_BH(net, LINUX_MIB_TCPBACKLOGDROP);
 		goto discard_and_relse;

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2 net-next] tcp: sk_add_backlog() is too agressive for TCP
  2012-04-24  2:20         ` Eric Dumazet
@ 2012-04-24  2:27           ` David Miller
  2012-04-24  8:01           ` Ilpo Järvinen
  1 sibling, 0 replies; 37+ messages in thread
From: David Miller @ 2012-04-24  2:27 UTC (permalink / raw)
  To: eric.dumazet
  Cc: rick.jones2, netdev, therbert, ncardwell, maze, ycheng, ilpo.jarvinen

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 24 Apr 2012 04:20:12 +0200

> On Mon, 2012-04-23 at 22:37 +0200, Eric Dumazet wrote:
> 
>> We could try to coalesce ACKs before backlogging them. I'll work on
>> this.
>> 
> 
> I did an experiment, and found a basic coalescing was not working in
> case of packet loss and SACK storm.
> 
> Doing a smart coalescing in this case sounds really complex.
> 
> Should we really continue this way ? 

We can consider it later, but for now let's defer.

I'll apply your sk_add_backlog patches for the time being.

Thanks Eric.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2 net-next] tcp: sk_add_backlog() is too agressive for TCP
  2012-04-24  2:20         ` Eric Dumazet
  2012-04-24  2:27           ` David Miller
@ 2012-04-24  8:01           ` Ilpo Järvinen
  2012-04-24  8:10             ` David Miller
  2012-04-24  8:49             ` [PATCH 2/2 net-next] tcp: sk_add_backlog() is too agressive for TCP Eric Dumazet
  1 sibling, 2 replies; 37+ messages in thread
From: Ilpo Järvinen @ 2012-04-24  8:01 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, rick.jones2, Netdev, therbert, ncardwell, maze,
	Yuchung Cheng

On Tue, 24 Apr 2012, Eric Dumazet wrote:

> On Mon, 2012-04-23 at 22:37 +0200, Eric Dumazet wrote:
> 
> > We could try to coalesce ACKs before backlogging them. I'll work on
> > this.
> 
> I did an experiment, and found a basic coalescing was not working in
> case of packet loss and SACK storm.

...That case might also have some performance issues at the receiver end 
when a hole is filled and TCP pushes stuff higher up.

> Doing a smart coalescing in this case sounds really complex.

Why's that? ...We'd compare options 32-bit at a time (like you already do 
anyway) and if we find difference we check the previous bits to validate 
it's a SACK option (the changing one should be in the first start-end 
pair). ...As long as there's no hole in every other segment we'd be 
winners I think.

> Should we really continue this way ? 

Why not, but wouldn't it be nicer to coalesce them already in GRO below 
with an assumption that GRO is likely to find some "mss" equivivalent 
which tells the gap between consecutive ACK (or even SACK) seqnos?

I've been long thinking that it would be nice to run offloading for ACKs 
too, and possibly even for SACKs, in both ends, although that might not be 
possible with other than GSO/GRO, at least atm.

-- 
 i.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2 net-next] tcp: sk_add_backlog() is too agressive for TCP
  2012-04-24  8:01           ` Ilpo Järvinen
@ 2012-04-24  8:10             ` David Miller
  2012-04-24  8:21               ` Ilpo Järvinen
  2012-04-24  8:49             ` [PATCH 2/2 net-next] tcp: sk_add_backlog() is too agressive for TCP Eric Dumazet
  1 sibling, 1 reply; 37+ messages in thread
From: David Miller @ 2012-04-24  8:10 UTC (permalink / raw)
  To: ilpo.jarvinen
  Cc: eric.dumazet, rick.jones2, netdev, therbert, ncardwell, maze, ycheng

From: "Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi>
Date: Tue, 24 Apr 2012 11:01:37 +0300 (EEST)

> Why not, but wouldn't it be nicer to coalesce them already in GRO below 
> with an assumption that GRO is likely to find some "mss" equivivalent 
> which tells the gap between consecutive ACK (or even SACK) seqnos?

GRO must be able to precisely reproduce the input stream if the packet
is forwarded and therefore we end up resegmenting on output with GSO.
That makes this a non-starter since we must therefore remember all of
the SACK boundaries in the original packets.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2 net-next] tcp: sk_add_backlog() is too agressive for TCP
  2012-04-24  8:10             ` David Miller
@ 2012-04-24  8:21               ` Ilpo Järvinen
  2012-04-24  8:25                 ` David Miller
  0 siblings, 1 reply; 37+ messages in thread
From: Ilpo Järvinen @ 2012-04-24  8:21 UTC (permalink / raw)
  To: David Miller
  Cc: Eric Dumazet, rick.jones2, Netdev, therbert, ncardwell, maze,
	Yuchung Cheng

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1157 bytes --]

On Tue, 24 Apr 2012, David Miller wrote:

> From: "Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi>
> Date: Tue, 24 Apr 2012 11:01:37 +0300 (EEST)
> 
> > Why not, but wouldn't it be nicer to coalesce them already in GRO below 
> > with an assumption that GRO is likely to find some "mss" equivivalent 
> > which tells the gap between consecutive ACK (or even SACK) seqnos?
> 
> GRO must be able to precisely reproduce the input stream if the packet
> is forwarded and therefore we end up resegmenting on output with GSO.

I'm aware of this.

> That makes this a non-starter since we must therefore remember all of
> the SACK boundaries in the original packets.

GRO works because TCP tends to use rather constant MSS, right? ...Since 
ACKs and SACKs are nothing more than reflection of those MSS boundaries of 
the opposite direction I don't find that as impossible as you do because 
the same kind of "mss" assumption can be applied there. But GRO has made 
this somewhat messier now because the receiver doesn't any more generate 
ACK per MSS or ACK per 2*MSS but that could be "fixed" by offloading the 
ACK sending when responding to a GROed packet.

-- 
 i.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2 net-next] tcp: sk_add_backlog() is too agressive for TCP
  2012-04-24  8:21               ` Ilpo Järvinen
@ 2012-04-24  8:25                 ` David Miller
  2012-04-24  8:40                   ` Ilpo Järvinen
  0 siblings, 1 reply; 37+ messages in thread
From: David Miller @ 2012-04-24  8:25 UTC (permalink / raw)
  To: ilpo.jarvinen
  Cc: eric.dumazet, rick.jones2, netdev, therbert, ncardwell, maze, ycheng

From: "Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi>
Date: Tue, 24 Apr 2012 11:21:18 +0300 (EEST)

> On Tue, 24 Apr 2012, David Miller wrote:
> 
>> That makes this a non-starter since we must therefore remember all of
>> the SACK boundaries in the original packets.
> 
> GRO works because TCP tends to use rather constant MSS, right? ...Since 
> ACKs and SACKs are nothing more than reflection of those MSS boundaries of 
> the opposite direction I don't find that as impossible as you do because 
> the same kind of "mss" assumption can be applied there. But GRO has made 
> this somewhat messier now because the receiver doesn't any more generate 
> ACK per MSS or ACK per 2*MSS but that could be "fixed" by offloading the 
> ACK sending when responding to a GROed packet.

We're talking about accumulating ACKs on GRO not data packets.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2 net-next] tcp: sk_add_backlog() is too agressive for TCP
  2012-04-24  8:25                 ` David Miller
@ 2012-04-24  8:40                   ` Ilpo Järvinen
  2012-04-24  8:48                     ` David Miller
  2012-04-24  8:56                     ` Eric Dumazet
  0 siblings, 2 replies; 37+ messages in thread
From: Ilpo Järvinen @ 2012-04-24  8:40 UTC (permalink / raw)
  To: David Miller
  Cc: Eric Dumazet, rick.jones2, Netdev, therbert, ncardwell, maze,
	Yuchung Cheng

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2021 bytes --]

On Tue, 24 Apr 2012, David Miller wrote:

> From: "Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi>
> Date: Tue, 24 Apr 2012 11:21:18 +0300 (EEST)
> 
> > On Tue, 24 Apr 2012, David Miller wrote:
> > 
> >> That makes this a non-starter since we must therefore remember all of
> >> the SACK boundaries in the original packets.
> > 
> > GRO works because TCP tends to use rather constant MSS, right? ...Since 
> > ACKs and SACKs are nothing more than reflection of those MSS boundaries of 
> > the opposite direction I don't find that as impossible as you do because 
> > the same kind of "mss" assumption can be applied there. But GRO has made 
> > this somewhat messier now because the receiver doesn't any more generate 
> > ACK per MSS or ACK per 2*MSS but that could be "fixed" by offloading the 
> > ACK sending when responding to a GROed packet.
> 
> We're talking about accumulating ACKs on GRO not data packets.

So am I... :-). ...Code speaks more than thousands of words:

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 8bb6ade..33b87b2 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2820,7 +2820,11 @@ found:
 	flush |= (__force int)(flags & TCP_FLAG_CWR);
 	flush |= (__force int)((flags ^ tcp_flag_word(th2)) &
 		  ~(TCP_FLAG_CWR | TCP_FLAG_FIN | TCP_FLAG_PSH));
-	flush |= (__force int)(th->ack_seq ^ th2->ack_seq);
+
+	ackgap = skb_shinfo(p)->ack_size;
+	ackdiff = th2->ack_seq - th->ack_seq;
+	flush |= (ackdiff - 1) >= ackgap;
+
 	for (i = sizeof(*th); i < thlen; i += 4)
 		flush |= *(u32 *)((u8 *)th + i) ^
 			 *(u32 *)((u8 *)th2 + i);


...Obviously Data and ACK couldn't be GROed at the same time practically 
(would allow reusing the gso_size field for ack_size). ...But why exactly 
you think this is not possible or viable solution if fully implemented?

And the problem I mentioned in the previous mail (in the terms of this 
code fragment) is that ackdiff is no longer MSS or 2*MSS because of GRO 
for the opposite direction doesn't trigger all those ACKs a non-GRO 
receiver would.

-- 
 i.

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* RE: [PATCH 2/2 net-next] tcp: sk_add_backlog() is too agressive for TCP
  2012-04-23 20:37       ` Eric Dumazet
                           ` (2 preceding siblings ...)
  2012-04-24  2:20         ` Eric Dumazet
@ 2012-04-24  8:44         ` David Laight
  2012-04-24  8:53           ` Eric Dumazet
  3 siblings, 1 reply; 37+ messages in thread
From: David Laight @ 2012-04-24  8:44 UTC (permalink / raw)
  To: Eric Dumazet, David Miller
  Cc: rick.jones2, netdev, therbert, ncardwell, maze, ycheng, ilpo.jarvinen

 
> On Mon, 2012-04-23 at 16:01 -0400, David Miller wrote:
> 
> > Hmmm... why don't we just acknowledge reality and special case ACKs?
> > 
> 
> Yes why not.
> 
> 
> > If a TCP packet is dataless we should just let it go 
> > through no matter what and with no limits.
> >  It is by definition transient and will not
> > get queued up into the socket past this backlog stage.
> > 
> 
> Even being transient we need a limit. Without copybreak, an 
> ACK can cost 2048+256 bytes.
> 
> In my 10Gbit tests (standard netperf using 16K buffers), I've seen
> backlogs of 300 ACK packets...

What about forcing a copybreak for acks when above the rx buffer size?
That way you avoid the cost of the copy in teh normal case when
the data will be freed, but avoid the memory overhead when a lot
of acks (or rx data) is queued.

	David

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2 net-next] tcp: sk_add_backlog() is too agressive for TCP
  2012-04-24  8:40                   ` Ilpo Järvinen
@ 2012-04-24  8:48                     ` David Miller
  2012-04-24 10:32                       ` Ilpo Järvinen
  2012-04-24  8:56                     ` Eric Dumazet
  1 sibling, 1 reply; 37+ messages in thread
From: David Miller @ 2012-04-24  8:48 UTC (permalink / raw)
  To: ilpo.jarvinen
  Cc: eric.dumazet, rick.jones2, netdev, therbert, ncardwell, maze, ycheng

From: "Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi>
Date: Tue, 24 Apr 2012 11:40:27 +0300 (EEST)

> And the problem I mentioned in the previous mail (in the terms of this 
> code fragment) is that ackdiff is no longer MSS or 2*MSS because of GRO 
> for the opposite direction doesn't trigger all those ACKs a non-GRO 
> receiver would.

This is the ACK counting Eric mentioned, which we need to preserve
for the purposes of CWND growth.

And in any event, you cannot depend upon the ACK boundaries in any way
shape or form, 2 * MSS is just one valid mode of ACK'ing.

Whether we get a GRO stretch ACK or a normal 2 * MSS one, it has to
work properly either way.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2 net-next] tcp: sk_add_backlog() is too agressive for TCP
  2012-04-24  8:01           ` Ilpo Järvinen
  2012-04-24  8:10             ` David Miller
@ 2012-04-24  8:49             ` Eric Dumazet
  1 sibling, 0 replies; 37+ messages in thread
From: Eric Dumazet @ 2012-04-24  8:49 UTC (permalink / raw)
  To: Ilpo Järvinen
  Cc: David Miller, rick.jones2, Netdev, therbert, ncardwell, maze,
	Yuchung Cheng

On Tue, 2012-04-24 at 11:01 +0300, Ilpo Järvinen wrote:
> On Tue, 24 Apr 2012, Eric Dumazet wrote:
> 
> > On Mon, 2012-04-23 at 22:37 +0200, Eric Dumazet wrote:
> > 
> > > We could try to coalesce ACKs before backlogging them. I'll work on
> > > this.
> > 
> > I did an experiment, and found a basic coalescing was not working in
> > case of packet loss and SACK storm.
> 
> ...That case might also have some performance issues at the receiver end 
> when a hole is filled and TCP pushes stuff higher up.
> 
> > Doing a smart coalescing in this case sounds really complex.
> 
> Why's that? ...We'd compare options 32-bit at a time (like you already do 
> anyway) and if we find difference we check the previous bits to validate 
> it's a SACK option (the changing one should be in the first start-end 
> pair). ...As long as there's no hole in every other segment we'd be 
> winners I think.
> 
> > Should we really continue this way ? 
> 
> Why not, but wouldn't it be nicer to coalesce them already in GRO below 
> with an assumption that GRO is likely to find some "mss" equivivalent 
> which tells the gap between consecutive ACK (or even SACK) seqnos?
> 
> I've been long thinking that it would be nice to run offloading for ACKs 
> too, and possibly even for SACKs, in both ends, although that might not be 
> possible with other than GSO/GRO, at least atm.
> 
> 


GRO doesnt coalesce pure acks, thats part of GRO contract.

By the way, I find GRO less and less attractive, if we have fragged skbs
provided by drivers, we can do the GRO almost for free in tcp stack,
instead of very complex/duplicated logic before tcp stack.

http://git.kernel.org/?p=linux/kernel/git/davem/net-next.git;a=commit;h=1402d366019fedaa2b024f2bac06b7cc9a8782e1

Doing this coalescing in tcp stack solves many problems GRO is unable to
address, like reordering...

^ permalink raw reply	[flat|nested] 37+ messages in thread

* RE: [PATCH 2/2 net-next] tcp: sk_add_backlog() is too agressive for TCP
  2012-04-24  8:44         ` David Laight
@ 2012-04-24  8:53           ` Eric Dumazet
  0 siblings, 0 replies; 37+ messages in thread
From: Eric Dumazet @ 2012-04-24  8:53 UTC (permalink / raw)
  To: David Laight
  Cc: David Miller, rick.jones2, netdev, therbert, ncardwell, maze,
	ycheng, ilpo.jarvinen

On Tue, 2012-04-24 at 09:44 +0100, David Laight wrote:
> > On Mon, 2012-04-23 at 16:01 -0400, David Miller wrote:
> > 
> > > Hmmm... why don't we just acknowledge reality and special case ACKs?
> > > 
> > 
> > Yes why not.
> > 
> > 
> > > If a TCP packet is dataless we should just let it go 
> > > through no matter what and with no limits.
> > >  It is by definition transient and will not
> > > get queued up into the socket past this backlog stage.
> > > 
> > 
> > Even being transient we need a limit. Without copybreak, an 
> > ACK can cost 2048+256 bytes.
> > 
> > In my 10Gbit tests (standard netperf using 16K buffers), I've seen
> > backlogs of 300 ACK packets...
> 
> What about forcing a copybreak for acks when above the rx buffer size?
> That way you avoid the cost of the copy in teh normal case when
> the data will be freed, but avoid the memory overhead when a lot
> of acks (or rx data) is queued.

Thats noise, as the minimal truesize of an ACK packet is 512 + 256 on
x86_64

The fact that ixgbe provides 1024 + 256 could be fixed in the driver,
its a 4 lines change actually. Then you already have a minimal skb.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2 net-next] tcp: sk_add_backlog() is too agressive for TCP
  2012-04-24  8:40                   ` Ilpo Järvinen
  2012-04-24  8:48                     ` David Miller
@ 2012-04-24  8:56                     ` Eric Dumazet
  2012-04-26  8:10                       ` [RFC] allow skb->head to point/alias to first skb frag Eric Dumazet
  1 sibling, 1 reply; 37+ messages in thread
From: Eric Dumazet @ 2012-04-24  8:56 UTC (permalink / raw)
  To: Ilpo Järvinen
  Cc: David Miller, rick.jones2, Netdev, therbert, ncardwell, maze,
	Yuchung Cheng

On Tue, 2012-04-24 at 11:40 +0300, Ilpo Järvinen wrote:

> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 8bb6ade..33b87b2 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -2820,7 +2820,11 @@ found:
>  	flush |= (__force int)(flags & TCP_FLAG_CWR);
>  	flush |= (__force int)((flags ^ tcp_flag_word(th2)) &
>  		  ~(TCP_FLAG_CWR | TCP_FLAG_FIN | TCP_FLAG_PSH));
> -	flush |= (__force int)(th->ack_seq ^ th2->ack_seq);
> +
> +	ackgap = skb_shinfo(p)->ack_size;
> +	ackdiff = th2->ack_seq - th->ack_seq;
> +	flush |= (ackdiff - 1) >= ackgap;
> +
>  	for (i = sizeof(*th); i < thlen; i += 4)
>  		flush |= *(u32 *)((u8 *)th + i) ^
>  			 *(u32 *)((u8 *)th2 + i);
> 

See how we duplicate tcp stack in GRO layer ;)

Really I have many doubts about GRO today.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2 net-next] tcp: sk_add_backlog() is too agressive for TCP
  2012-04-24  8:48                     ` David Miller
@ 2012-04-24 10:32                       ` Ilpo Järvinen
  0 siblings, 0 replies; 37+ messages in thread
From: Ilpo Järvinen @ 2012-04-24 10:32 UTC (permalink / raw)
  To: David Miller
  Cc: Eric Dumazet, rick.jones2, Netdev, therbert, ncardwell, maze,
	Yuchung Cheng

[-- Attachment #1: Type: TEXT/PLAIN, Size: 982 bytes --]

On Tue, 24 Apr 2012, David Miller wrote:

> From: "Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi>
> Date: Tue, 24 Apr 2012 11:40:27 +0300 (EEST)
> 
> > And the problem I mentioned in the previous mail (in the terms of this 
> > code fragment) is that ackdiff is no longer MSS or 2*MSS because of GRO 
> > for the opposite direction doesn't trigger all those ACKs a non-GRO 
> > receiver would.
> 
> And in any event, you cannot depend upon the ACK boundaries in any way
> shape or form, 2 * MSS is just one valid mode of ACK'ing.
> 
> Whether we get a GRO stretch ACK or a normal 2 * MSS one, it has to
> work properly either way.

I don't see how any of this is that much different from the data side. x 
bytes, x bytes, x bytes, ... (typically x=MSS) is just one valid mode of 
data sending, and yet the GRO is currently based on this one mode only. 
...It of course works properly also when that is not true but won't work 
as efficiently as when the size remains the same.


-- 
 i.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2 net-next] tcp: sk_add_backlog() is too agressive for TCP
  2012-04-23 22:16                 ` Rick Jones
@ 2012-04-24 15:25                   ` Christoph Lameter
  0 siblings, 0 replies; 37+ messages in thread
From: Christoph Lameter @ 2012-04-24 15:25 UTC (permalink / raw)
  To: Rick Jones
  Cc: Eric Dumazet, David Miller, netdev, therbert, ncardwell, maze,
	ycheng, ilpo.jarvinen

On Mon, 23 Apr 2012, Rick Jones wrote:

> Is it at all possible to have the copies happen without the connection being
> locked?  If indeed it is possible to be held-off with the connection locked
> for the better part of 3/4 of a millisecond, just what will that do to 40 or
> 100 GbE?  If you've been seeing queues of 300 ACKs at 10 GbE that would be
> 3000 at 100 GbE, and assuming those are all in a 2048 byte buffer thats 6MB
> just of ACKs.  I suppose 100GbE does mean non-trivial quantities of buffering
> anyway but that does still seem rather high.

At some point people will need to realize that it is not business as usual
if one tries to use the current network porotocols at speeds above 1G.

There is a reason for high speed networks implementing new protocols like
RDMA techniques and lossless characteristics of a network.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [RFC] allow skb->head to point/alias to first skb frag
  2012-04-24  8:56                     ` Eric Dumazet
@ 2012-04-26  8:10                       ` Eric Dumazet
  2012-04-26  8:36                         ` David Miller
  0 siblings, 1 reply; 37+ messages in thread
From: Eric Dumazet @ 2012-04-26  8:10 UTC (permalink / raw)
  To: Ilpo Järvinen
  Cc: David Miller, rick.jones2, Netdev, therbert, ncardwell, maze,
	Yuchung Cheng

On Tue, 2012-04-24 at 10:56 +0200, Eric Dumazet wrote:

> Really I have many doubts about GRO today.

Particularly if a NIC driver provides linear skbs :
Resulting gro skb is a list of skbs, no memory savings, and many cache
line misses when copying to userland.

Maybe we could have a new kind of skb head allocation/setup, pointing to
a frag of 2048 bytes instead of a kmalloc() blob.

Right now, a fragged skb used 3 blocks of memory :

1) struct sk_buff
2) a bloc of 512 or 1024 or 2048 bytes of memory (skb->head)
3) a frag of 2048 (or PAGE_SIZE/2 or PAGE_SIZE)

While a linear skb has :

1) struct sk_buff
2) a bloc of 512 or 1024 or 2048 bytes of memory (skb->head) from
kmalloc()


Idea would have :

1) struct sk_buff
2) skb->head points to frag (aliasing, no memory allocation)
3) frag of 2048 (or PAGE_SIZE/2 or PAGE_SIZE)

Or the reverse (no frag so that skb is considered as linea), but special
code to 'allow' this skb head be considered as a frag when needed
(splice() code, or GRO merge, or TCP coalescing)


That would make GRO (and TCP coalescing) much more efficient, since the
resulting aggregated skb would be :
1) struct sk_buff
2) skb->head points to 1st frag (aliasing, no memory allocation)
3) array of [1..16] frags

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC] allow skb->head to point/alias to first skb frag
  2012-04-26  8:10                       ` [RFC] allow skb->head to point/alias to first skb frag Eric Dumazet
@ 2012-04-26  8:36                         ` David Miller
  2012-04-26  9:10                           ` Eric Dumazet
  0 siblings, 1 reply; 37+ messages in thread
From: David Miller @ 2012-04-26  8:36 UTC (permalink / raw)
  To: eric.dumazet
  Cc: ilpo.jarvinen, rick.jones2, netdev, therbert, ncardwell, maze, ycheng

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 26 Apr 2012 10:10:54 +0200

> Idea would have :
> 
> 1) struct sk_buff
> 2) skb->head points to frag (aliasing, no memory allocation)
> 3) frag of 2048 (or PAGE_SIZE/2 or PAGE_SIZE)

What would you set skb->data_len and skb->len to?

How would you handle tailroom?  We have the rule that you can't append
to the tail of a fragmented SKB (I mean append, the way that IPSEC
wants to write data to the end of an SKB when encrypting, for
example).  This is only allowed on fully linear SKBs.

So, for this reason and others, you can't pretend that this new
construction is linear in any way.  It would break a bunch of things.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC] allow skb->head to point/alias to first skb frag
  2012-04-26  8:36                         ` David Miller
@ 2012-04-26  9:10                           ` Eric Dumazet
  2012-04-26  9:18                             ` David Miller
  0 siblings, 1 reply; 37+ messages in thread
From: Eric Dumazet @ 2012-04-26  9:10 UTC (permalink / raw)
  To: David Miller
  Cc: ilpo.jarvinen, rick.jones2, netdev, therbert, ncardwell, maze, ycheng

On Thu, 2012-04-26 at 04:36 -0400, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Thu, 26 Apr 2012 10:10:54 +0200
> 
> > Idea would have :
> > 
> > 1) struct sk_buff
> > 2) skb->head points to frag (aliasing, no memory allocation)
> > 3) frag of 2048 (or PAGE_SIZE/2 or PAGE_SIZE)
> 
> What would you set skb->data_len and skb->len to?
> 
> How would you handle tailroom?  We have the rule that you can't append
> to the tail of a fragmented SKB (I mean append, the way that IPSEC
> wants to write data to the end of an SKB when encrypting, for
> example).  This is only allowed on fully linear SKBs.
> 
> So, for this reason and others, you can't pretend that this new
> construction is linear in any way.  It would break a bunch of things.


The 'frag' would have a known size : 2048 bytes

But the end of it would be used by struct skb_shared_info

so data_len would be 0 in fact.

This would look like a regular linear skb.

Just a bit set in skb to say : Warning, skb->head was not kmalloced :
replace kfree(head) by put_page(...)

And this bit would be tested in GRO or tcp merge to 'upgrade' this
skb->head to proper page/frag

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC] allow skb->head to point/alias to first skb frag
  2012-04-26  9:10                           ` Eric Dumazet
@ 2012-04-26  9:18                             ` David Miller
  2012-04-26  9:22                               ` Eric Dumazet
  0 siblings, 1 reply; 37+ messages in thread
From: David Miller @ 2012-04-26  9:18 UTC (permalink / raw)
  To: eric.dumazet
  Cc: ilpo.jarvinen, rick.jones2, netdev, therbert, ncardwell, maze, ycheng

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 26 Apr 2012 11:10:02 +0200

> The 'frag' would have a known size : 2048 bytes
> 
> But the end of it would be used by struct skb_shared_info
> 
> so data_len would be 0 in fact.
> 
> This would look like a regular linear skb.
> 
> Just a bit set in skb to say : Warning, skb->head was not kmalloced :
> replace kfree(head) by put_page(...)
> 
> And this bit would be tested in GRO or tcp merge to 'upgrade' this
> skb->head to proper page/frag

And what happens if this ends up in a piece of code which wants to
append a page frag?

Or a piece of code which copies an SKB, including page frag parts?

All I'm saying is that the number of tests necessary to make this work
properly might become prohibitive.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC] allow skb->head to point/alias to first skb frag
  2012-04-26  9:18                             ` David Miller
@ 2012-04-26  9:22                               ` Eric Dumazet
  2012-04-26 10:09                                 ` Maciej Żenczykowski
  0 siblings, 1 reply; 37+ messages in thread
From: Eric Dumazet @ 2012-04-26  9:22 UTC (permalink / raw)
  To: David Miller
  Cc: ilpo.jarvinen, rick.jones2, netdev, therbert, ncardwell, maze, ycheng

On Thu, 2012-04-26 at 05:18 -0400, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Thu, 26 Apr 2012 11:10:02 +0200
> 
> > The 'frag' would have a known size : 2048 bytes
> > 
> > But the end of it would be used by struct skb_shared_info
> > 
> > so data_len would be 0 in fact.
> > 
> > This would look like a regular linear skb.
> > 
> > Just a bit set in skb to say : Warning, skb->head was not kmalloced :
> > replace kfree(head) by put_page(...)
> > 
> > And this bit would be tested in GRO or tcp merge to 'upgrade' this
> > skb->head to proper page/frag
> 
> And what happens if this ends up in a piece of code which wants to
> append a page frag?
> 
> Or a piece of code which copies an SKB, including page frag parts?
> 
> All I'm saying is that the number of tests necessary to make this work
> properly might become prohibitive.

I'll cook a patch, I believe not so many parts assume skb->head can be
kmalloc()/kfree(). This should be contained in net/core/skbuff.c only.

Elsewhere we use skb->head as a pointer to memory. It will stay the
same.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC] allow skb->head to point/alias to first skb frag
  2012-04-26  9:22                               ` Eric Dumazet
@ 2012-04-26 10:09                                 ` Maciej Żenczykowski
  2012-04-26 12:32                                   ` Eric Dumazet
  0 siblings, 1 reply; 37+ messages in thread
From: Maciej Żenczykowski @ 2012-04-26 10:09 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, ilpo.jarvinen, rick.jones2, netdev, therbert,
	ncardwell, ycheng

it would be very useful if there was an api for freeing skb->head memory...

I've fooled around with a single memory buffer for both the head and
the data and seen huge performance wins under heavy packet load (half
the amount of malloc/free), but could never quite get it to be 100%
stable.

On Thu, Apr 26, 2012 at 2:22 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Thu, 2012-04-26 at 05:18 -0400, David Miller wrote:
>> From: Eric Dumazet <eric.dumazet@gmail.com>
>> Date: Thu, 26 Apr 2012 11:10:02 +0200
>>
>> > The 'frag' would have a known size : 2048 bytes
>> >
>> > But the end of it would be used by struct skb_shared_info
>> >
>> > so data_len would be 0 in fact.
>> >
>> > This would look like a regular linear skb.
>> >
>> > Just a bit set in skb to say : Warning, skb->head was not kmalloced :
>> > replace kfree(head) by put_page(...)
>> >
>> > And this bit would be tested in GRO or tcp merge to 'upgrade' this
>> > skb->head to proper page/frag
>>
>> And what happens if this ends up in a piece of code which wants to
>> append a page frag?
>>
>> Or a piece of code which copies an SKB, including page frag parts?
>>
>> All I'm saying is that the number of tests necessary to make this work
>> properly might become prohibitive.
>
> I'll cook a patch, I believe not so many parts assume skb->head can be
> kmalloc()/kfree(). This should be contained in net/core/skbuff.c only.
>
> Elsewhere we use skb->head as a pointer to memory. It will stay the
> same.
>
>
>



-- 
Maciej A. Żenczykowski
Kernel Networking Developer @ Google
1600 Amphitheatre Parkway, Mountain View, CA 94043
tel: +1 (650) 253-0062

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC] allow skb->head to point/alias to first skb frag
  2012-04-26 10:09                                 ` Maciej Żenczykowski
@ 2012-04-26 12:32                                   ` Eric Dumazet
  2012-04-26 13:50                                     ` Eric Dumazet
  0 siblings, 1 reply; 37+ messages in thread
From: Eric Dumazet @ 2012-04-26 12:32 UTC (permalink / raw)
  To: Maciej Żenczykowski
  Cc: David Miller, ilpo.jarvinen, rick.jones2, netdev, therbert,
	ncardwell, ycheng

On Thu, 2012-04-26 at 03:09 -0700, Maciej Żenczykowski wrote:
> it would be very useful if there was an api for freeing skb->head memory...
> 
> I've fooled around with a single memory buffer for both the head and
> the data and seen huge performance wins under heavy packet load (half
> the amount of malloc/free), but could never quite get it to be 100%
> stable.

My patch is almost ready, and seems very good.

I added a param to skb_build() to tell the memory comes from a (sub)page

tg3 uses the new thing.

GRO does the correct merging.

(by the way, I noticed GRO lies about skb truesize... since it
accumulates frag lengths instead of allocated space... with a 1448/2048
mismatch for MTU=1500)

And this apparently solves my slub performance issue I had on my dual
quad core machine, since I no longer hit slub slow path.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC] allow skb->head to point/alias to first skb frag
  2012-04-26 12:32                                   ` Eric Dumazet
@ 2012-04-26 13:50                                     ` Eric Dumazet
  2012-04-26 20:12                                       ` David Miller
  0 siblings, 1 reply; 37+ messages in thread
From: Eric Dumazet @ 2012-04-26 13:50 UTC (permalink / raw)
  To: Maciej Żenczykowski
  Cc: David Miller, ilpo.jarvinen, rick.jones2, netdev, therbert,
	ncardwell, ycheng

On Thu, 2012-04-26 at 14:32 +0200, Eric Dumazet wrote:
> On Thu, 2012-04-26 at 03:09 -0700, Maciej Żenczykowski wrote:
> > it would be very useful if there was an api for freeing skb->head memory...
> > 
> > I've fooled around with a single memory buffer for both the head and
> > the data and seen huge performance wins under heavy packet load (half
> > the amount of malloc/free), but could never quite get it to be 100%
> > stable.
> 
> My patch is almost ready, and seems very good.
> 
> I added a param to skb_build() to tell the memory comes from a (sub)page
> 
> tg3 uses the new thing.
> 
> GRO does the correct merging.
> 
> (by the way, I noticed GRO lies about skb truesize... since it
> accumulates frag lengths instead of allocated space... with a 1448/2048
> mismatch for MTU=1500)
> 
> And this apparently solves my slub performance issue I had on my dual
> quad core machine, since I no longer hit slub slow path.
> 
> 

Here is the raw patch. (tg3 part is dirty and needs some helpers)

I am working on the TCP coalesce part, the splice() helpers (to avoid a
copy), and split in five different patches for further
discussion/inclusion.


 drivers/net/ethernet/broadcom/bnx2.c            |    2 
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c |    4 
 drivers/net/ethernet/broadcom/tg3.c             |   61 +++++++++++---
 drivers/net/ethernet/broadcom/tg3.h             |    2 
 include/linux/skbuff.h                          |    7 +
 net/core/dev.c                                  |    5 -
 net/core/skbuff.c                               |   46 ++++++++--
 net/ipv4/tcp_input.c                            |    1 
 8 files changed, 103 insertions(+), 25 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnx2.c b/drivers/net/ethernet/broadcom/bnx2.c
index ab55979..ac7b744 100644
--- a/drivers/net/ethernet/broadcom/bnx2.c
+++ b/drivers/net/ethernet/broadcom/bnx2.c
@@ -3006,7 +3006,7 @@ error:
 
 	dma_unmap_single(&bp->pdev->dev, dma_addr, bp->rx_buf_use_size,
 			 PCI_DMA_FROMDEVICE);
-	skb = build_skb(data);
+	skb = build_skb(data, 0);
 	if (!skb) {
 		kfree(data);
 		goto error;
diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
index 60d5b54..9c5bc5d 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
@@ -513,7 +513,7 @@ static inline void bnx2x_tpa_stop(struct bnx2x *bp, struct bnx2x_fastpath *fp,
 	dma_unmap_single(&bp->pdev->dev, dma_unmap_addr(rx_buf, mapping),
 			 fp->rx_buf_size, DMA_FROM_DEVICE);
 	if (likely(new_data))
-		skb = build_skb(data);
+		skb = build_skb(data, 0);
 
 	if (likely(skb)) {
 #ifdef BNX2X_STOP_ON_ERROR
@@ -721,7 +721,7 @@ int bnx2x_rx_int(struct bnx2x_fastpath *fp, int budget)
 						 dma_unmap_addr(rx_buf, mapping),
 						 fp->rx_buf_size,
 						 DMA_FROM_DEVICE);
-				skb = build_skb(data);
+				skb = build_skb(data, 0);
 				if (unlikely(!skb)) {
 					kfree(data);
 					fp->eth_q_stats.rx_skb_alloc_failed++;
diff --git a/drivers/net/ethernet/broadcom/tg3.c b/drivers/net/ethernet/broadcom/tg3.c
index 0c3e7c7..d216076 100644
--- a/drivers/net/ethernet/broadcom/tg3.c
+++ b/drivers/net/ethernet/broadcom/tg3.c
@@ -5619,12 +5619,18 @@ static void tg3_tx(struct tg3_napi *tnapi)
 
 static void tg3_rx_data_free(struct tg3 *tp, struct ring_info *ri, u32 map_sz)
 {
+	unsigned int skb_size = SKB_DATA_ALIGN(map_sz + TG3_RX_OFFSET(tp)) +
+		   SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+
 	if (!ri->data)
 		return;
 
 	pci_unmap_single(tp->pdev, dma_unmap_addr(ri, mapping),
 			 map_sz, PCI_DMA_FROMDEVICE);
-	kfree(ri->data);
+	if (skb_size <= 2048)
+		put_page(virt_to_head_page(ri->data));
+	else
+		kfree(ri->data);
 	ri->data = NULL;
 }
 
@@ -5640,7 +5646,8 @@ static void tg3_rx_data_free(struct tg3 *tp, struct ring_info *ri, u32 map_sz)
  * (to fetch the error flags, vlan tag, checksum, and opaque cookie).
  */
 static int tg3_alloc_rx_data(struct tg3 *tp, struct tg3_rx_prodring_set *tpr,
-			    u32 opaque_key, u32 dest_idx_unmasked)
+			     u32 opaque_key, u32 dest_idx_unmasked,
+			     unsigned int *frag_size)
 {
 	struct tg3_rx_buffer_desc *desc;
 	struct ring_info *map;
@@ -5675,16 +5682,36 @@ static int tg3_alloc_rx_data(struct tg3 *tp, struct tg3_rx_prodring_set *tpr,
 	 */
 	skb_size = SKB_DATA_ALIGN(data_size + TG3_RX_OFFSET(tp)) +
 		   SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
-	data = kmalloc(skb_size, GFP_ATOMIC);
-	if (!data)
-		return -ENOMEM;
-
+	if (skb_size <= 2048) {
+		if (tpr->rx_page_size < 2048) {
+			struct page *page = alloc_page(GFP_ATOMIC);
+			if (!page)
+				return -ENOMEM;
+			atomic_add((PAGE_SIZE / 2048) - 1, &page->_count); 
+			tpr->rx_page_addr = page_address(page);
+			tpr->rx_page_size = PAGE_SIZE;
+		}
+		data = tpr->rx_page_addr;
+		tpr->rx_page_addr += 2048;
+		tpr->rx_page_size -= 2048;
+		*frag_size = 2048;
+	} else {
+		data = kmalloc(skb_size, GFP_ATOMIC);
+		if (!data)
+			return -ENOMEM;
+		*frag_size = 0;
+	}
 	mapping = pci_map_single(tp->pdev,
 				 data + TG3_RX_OFFSET(tp),
 				 data_size,
 				 PCI_DMA_FROMDEVICE);
 	if (pci_dma_mapping_error(tp->pdev, mapping)) {
-		kfree(data);
+		if (skb_size <= 2048) {
+			tpr->rx_page_addr -= 2048;
+			tpr->rx_page_size += 2048;
+		} else {
+			kfree(data);
+		}
 		return -EIO;
 	}
 
@@ -5835,18 +5862,22 @@ static int tg3_rx(struct tg3_napi *tnapi, int budget)
 
 		if (len > TG3_RX_COPY_THRESH(tp)) {
 			int skb_size;
+			unsigned int frag_size;
 
 			skb_size = tg3_alloc_rx_data(tp, tpr, opaque_key,
-						    *post_ptr);
+						    *post_ptr, &frag_size);
 			if (skb_size < 0)
 				goto drop_it;
 
 			pci_unmap_single(tp->pdev, dma_addr, skb_size,
 					 PCI_DMA_FROMDEVICE);
 
-			skb = build_skb(data);
+			skb = build_skb(data, frag_size);
 			if (!skb) {
-				kfree(data);
+				if (frag_size)
+					put_page(virt_to_head_page(data));
+				else
+					kfree(data);
 				goto drop_it_no_recycle;
 			}
 			skb_reserve(skb, TG3_RX_OFFSET(tp));
@@ -7279,7 +7310,10 @@ static int tg3_rx_prodring_alloc(struct tg3 *tp,
 
 	/* Now allocate fresh SKBs for each rx ring. */
 	for (i = 0; i < tp->rx_pending; i++) {
-		if (tg3_alloc_rx_data(tp, tpr, RXD_OPAQUE_RING_STD, i) < 0) {
+		unsigned int frag_size;
+
+		if (tg3_alloc_rx_data(tp, tpr, RXD_OPAQUE_RING_STD, i,
+				      &frag_size) < 0) {
 			netdev_warn(tp->dev,
 				    "Using a smaller RX standard ring. Only "
 				    "%d out of %d buffers were allocated "
@@ -7311,7 +7345,10 @@ static int tg3_rx_prodring_alloc(struct tg3 *tp,
 	}
 
 	for (i = 0; i < tp->rx_jumbo_pending; i++) {
-		if (tg3_alloc_rx_data(tp, tpr, RXD_OPAQUE_RING_JUMBO, i) < 0) {
+		unsigned int frag_size;
+
+		if (tg3_alloc_rx_data(tp, tpr, RXD_OPAQUE_RING_JUMBO, i,
+				      &frag_size) < 0) {
 			netdev_warn(tp->dev,
 				    "Using a smaller RX jumbo ring. Only %d "
 				    "out of %d buffers were allocated "
diff --git a/drivers/net/ethernet/broadcom/tg3.h b/drivers/net/ethernet/broadcom/tg3.h
index 93865f8..7c85545 100644
--- a/drivers/net/ethernet/broadcom/tg3.h
+++ b/drivers/net/ethernet/broadcom/tg3.h
@@ -2815,6 +2815,8 @@ struct tg3_rx_prodring_set {
 	struct ring_info		*rx_jmb_buffers;
 	dma_addr_t			rx_std_mapping;
 	dma_addr_t			rx_jmb_mapping;
+	void				*rx_page_addr;
+	unsigned int			rx_page_size;
 };
 
 #define TG3_IRQ_MAX_VECS_RSS		5
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 4a656b5..1161111 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -470,7 +470,8 @@ struct sk_buff {
 	__u8			wifi_acked_valid:1;
 	__u8			wifi_acked:1;
 	__u8			no_fcs:1;
-	/* 9/11 bit hole (depending on ndisc_nodetype presence) */
+	__u8			head_frag:1;
+	/* 8/10 bit hole (depending on ndisc_nodetype presence) */
 	kmemcheck_bitfield_end(flags2);
 
 #ifdef CONFIG_NET_DMA
@@ -558,11 +559,13 @@ static inline struct rtable *skb_rtable(const struct sk_buff *skb)
 }
 
 extern void kfree_skb(struct sk_buff *skb);
+extern struct kmem_cache *skbuff_head_cache;
+
 extern void consume_skb(struct sk_buff *skb);
 extern void	       __kfree_skb(struct sk_buff *skb);
 extern struct sk_buff *__alloc_skb(unsigned int size,
 				   gfp_t priority, int fclone, int node);
-extern struct sk_buff *build_skb(void *data);
+extern struct sk_buff *build_skb(void *data, unsigned int frag_size);
 static inline struct sk_buff *alloc_skb(unsigned int size,
 					gfp_t priority)
 {
diff --git a/net/core/dev.c b/net/core/dev.c
index 501f3cc..9537321 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3546,7 +3546,10 @@ gro_result_t napi_skb_finish(gro_result_t ret, struct sk_buff *skb)
 		break;
 
 	case GRO_MERGED_FREE:
-		consume_skb(skb);
+		if (skb->head_frag)
+			kmem_cache_free(skbuff_head_cache, skb);
+		else
+			consume_skb(skb);
 		break;
 
 	case GRO_HELD:
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 2342a72..d8fb4b2 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -69,7 +69,7 @@
 #include <trace/events/skb.h>
 #include <linux/highmem.h>
 
-static struct kmem_cache *skbuff_head_cache __read_mostly;
+struct kmem_cache *skbuff_head_cache __read_mostly;
 static struct kmem_cache *skbuff_fclone_cache __read_mostly;
 
 static void sock_pipe_buf_release(struct pipe_inode_info *pipe,
@@ -245,6 +245,7 @@ EXPORT_SYMBOL(__alloc_skb);
 /**
  * build_skb - build a network buffer
  * @data: data buffer provided by caller
+ * @frag_size: size of fragment, or 0 if head was kmalloced
  *
  * Allocate a new &sk_buff. Caller provides space holding head and
  * skb_shared_info. @data must have been allocated by kmalloc()
@@ -258,20 +259,21 @@ EXPORT_SYMBOL(__alloc_skb);
  *  before giving packet to stack.
  *  RX rings only contains data buffers, not full skbs.
  */
-struct sk_buff *build_skb(void *data)
+struct sk_buff *build_skb(void *data, unsigned int frag_size)
 {
 	struct skb_shared_info *shinfo;
 	struct sk_buff *skb;
-	unsigned int size;
+	unsigned int size = frag_size ? : ksize(data);
 
 	skb = kmem_cache_alloc(skbuff_head_cache, GFP_ATOMIC);
 	if (!skb)
 		return NULL;
 
-	size = ksize(data) - SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+	size -= SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
 
 	memset(skb, 0, offsetof(struct sk_buff, tail));
 	skb->truesize = SKB_TRUESIZE(size);
+	skb->head_frag = frag_size != 0;
 	atomic_set(&skb->users, 1);
 	skb->head = data;
 	skb->data = data;
@@ -376,6 +378,14 @@ static void skb_clone_fraglist(struct sk_buff *skb)
 		skb_get(list);
 }
 
+static void skb_free_head(struct sk_buff *skb)
+{
+	if (skb->head_frag)
+		put_page(virt_to_head_page(skb->head));
+	else
+		kfree(skb->head);
+}
+
 static void skb_release_data(struct sk_buff *skb)
 {
 	if (!skb->cloned ||
@@ -402,7 +412,7 @@ static void skb_release_data(struct sk_buff *skb)
 		if (skb_has_frag_list(skb))
 			skb_drop_fraglist(skb);
 
-		kfree(skb->head);
+		skb_free_head(skb);
 	}
 }
 
@@ -644,6 +654,7 @@ static struct sk_buff *__skb_clone(struct sk_buff *n, struct sk_buff *skb)
 	C(tail);
 	C(end);
 	C(head);
+	C(head_frag);
 	C(data);
 	C(truesize);
 	atomic_set(&n->users, 1);
@@ -940,7 +951,7 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
 		fastpath = atomic_read(&skb_shinfo(skb)->dataref) == delta;
 	}
 
-	if (fastpath &&
+	if (fastpath && !skb->head_frag &&
 	    size + sizeof(struct skb_shared_info) <= ksize(skb->head)) {
 		memmove(skb->head + size, skb_shinfo(skb),
 			offsetof(struct skb_shared_info,
@@ -967,7 +978,7 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
 	       offsetof(struct skb_shared_info, frags[skb_shinfo(skb)->nr_frags]));
 
 	if (fastpath) {
-		kfree(skb->head);
+		skb_free_head(skb);
 	} else {
 		/* copy this zero copy skb frags */
 		if (skb_shinfo(skb)->tx_flags & SKBTX_DEV_ZEROCOPY) {
@@ -985,6 +996,7 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
 	off = (data + nhead) - skb->head;
 
 	skb->head     = data;
+	skb->head_frag = 0;
 adjust_others:
 	skb->data    += off;
 #ifdef NET_SKBUFF_DATA_USES_OFFSET
@@ -2889,6 +2901,26 @@ int skb_gro_receive(struct sk_buff **head, struct sk_buff *skb)
 
 		NAPI_GRO_CB(skb)->free = 1;
 		goto done;
+	} else if (skb->head_frag) {
+		int nr_frags = pinfo->nr_frags;
+		skb_frag_t *frag = pinfo->frags + nr_frags;
+		struct page *page = virt_to_head_page(skb->head);
+
+
+		if (nr_frags >= MAX_SKB_FRAGS)
+			return -E2BIG;
+		offset -= headlen;
+		offset += skb->head - (unsigned char *)page_address(page);
+
+		pinfo->nr_frags = nr_frags + 1;
+
+		frag->page.p	  = page;
+		frag->page_offset = offset;
+		skb_frag_size_set(frag, len);
+
+		/* note : napi_skb_finish() will free skb, not skb->head */
+		NAPI_GRO_CB(skb)->free = 1;
+		goto done;
 	} else if (skb_gro_len(p) != pinfo->gso_size)
 		return -E2BIG;
 
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index c93b0cb..69d379a 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4478,6 +4478,7 @@ merge:
 		TCP_SKB_CB(to)->ack_seq = TCP_SKB_CB(from)->ack_seq;
 		return true;
 	}
+	/* TODO : handle special case where from->frag_head is true */
 	if (skb_headlen(from) == 0 &&
 	    !skb_has_frag_list(to) &&
 	    !skb_has_frag_list(from) &&

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [RFC] allow skb->head to point/alias to first skb frag
  2012-04-26 13:50                                     ` Eric Dumazet
@ 2012-04-26 20:12                                       ` David Miller
  2012-04-26 20:18                                         ` Eric Dumazet
  0 siblings, 1 reply; 37+ messages in thread
From: David Miller @ 2012-04-26 20:12 UTC (permalink / raw)
  To: eric.dumazet
  Cc: maze, ilpo.jarvinen, rick.jones2, netdev, therbert, ncardwell, ycheng

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 26 Apr 2012 15:50:52 +0200

> Here is the raw patch. (tg3 part is dirty and needs some helpers)
> 
> I am working on the TCP coalesce part, the splice() helpers (to avoid a
> copy), and split in five different patches for further
> discussion/inclusion.

Ok, I like it.

I was confused into thinking that you were going to put this head page
into skb_shared_info()'s page vector.  But you're not.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC] allow skb->head to point/alias to first skb frag
  2012-04-26 20:12                                       ` David Miller
@ 2012-04-26 20:18                                         ` Eric Dumazet
  0 siblings, 0 replies; 37+ messages in thread
From: Eric Dumazet @ 2012-04-26 20:18 UTC (permalink / raw)
  To: David Miller
  Cc: maze, ilpo.jarvinen, rick.jones2, netdev, therbert, ncardwell, ycheng

On Thu, 2012-04-26 at 16:12 -0400, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Thu, 26 Apr 2012 15:50:52 +0200
> 
> > Here is the raw patch. (tg3 part is dirty and needs some helpers)
> > 
> > I am working on the TCP coalesce part, the splice() helpers (to avoid a
> > copy), and split in five different patches for further
> > discussion/inclusion.
> 
> Ok, I like it.
> 
> I was confused into thinking that you were going to put this head page
> into skb_shared_info()'s page vector.  But you're not.
> 

Well, my first mail was confuse... it was just an idea without prior
thinking. Now the picture is clear.

Thanks

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2012-04-26 20:18 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-04-23  9:38 [PATCH 2/2 net-next] tcp: sk_add_backlog() is too agressive for TCP Eric Dumazet
2012-04-23 17:14 ` Rick Jones
2012-04-23 17:23   ` Eric Dumazet
2012-04-23 20:01     ` David Miller
2012-04-23 20:37       ` Eric Dumazet
2012-04-23 20:57         ` Rick Jones
2012-04-23 21:30           ` Eric Dumazet
2012-04-23 21:51             ` Rick Jones
2012-04-23 21:56               ` Rick Jones
2012-04-23 22:05               ` Eric Dumazet
2012-04-23 22:16                 ` Rick Jones
2012-04-24 15:25                   ` Christoph Lameter
2012-04-23 21:01         ` David Miller
2012-04-23 21:38           ` Eric Dumazet
2012-04-24  2:20         ` Eric Dumazet
2012-04-24  2:27           ` David Miller
2012-04-24  8:01           ` Ilpo Järvinen
2012-04-24  8:10             ` David Miller
2012-04-24  8:21               ` Ilpo Järvinen
2012-04-24  8:25                 ` David Miller
2012-04-24  8:40                   ` Ilpo Järvinen
2012-04-24  8:48                     ` David Miller
2012-04-24 10:32                       ` Ilpo Järvinen
2012-04-24  8:56                     ` Eric Dumazet
2012-04-26  8:10                       ` [RFC] allow skb->head to point/alias to first skb frag Eric Dumazet
2012-04-26  8:36                         ` David Miller
2012-04-26  9:10                           ` Eric Dumazet
2012-04-26  9:18                             ` David Miller
2012-04-26  9:22                               ` Eric Dumazet
2012-04-26 10:09                                 ` Maciej Żenczykowski
2012-04-26 12:32                                   ` Eric Dumazet
2012-04-26 13:50                                     ` Eric Dumazet
2012-04-26 20:12                                       ` David Miller
2012-04-26 20:18                                         ` Eric Dumazet
2012-04-24  8:49             ` [PATCH 2/2 net-next] tcp: sk_add_backlog() is too agressive for TCP Eric Dumazet
2012-04-24  8:44         ` David Laight
2012-04-24  8:53           ` Eric Dumazet

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.