netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC] tcp: use order-3 pages in tcp_sendmsg()
@ 2012-09-17  7:49 Eric Dumazet
  2012-09-17 16:12 ` David Miller
  2012-11-15  7:52 ` Yan, Zheng 
  0 siblings, 2 replies; 37+ messages in thread
From: Eric Dumazet @ 2012-09-17  7:49 UTC (permalink / raw)
  To: netdev

We currently use per socket page reserve for tcp_sendmsg() operations.

Its done to raise the probability of coalescing small write() in to
single segments in the skbs.

But it wastes a lot of memory for applications handling a lot of mostly
idle sockets, since each socket holds one page in sk->sk_sndmsg_page

I did a small experiment to use order-3 pages and it gave me a 10% boost
of performance, because each TSO skb can use only two frags of 32KB,
instead of 16 frags of 4KB, so we spend less time in ndo_start_xmit() to
setup the tx descriptor and TX completion path to unmap the frags and
free them.

We also spend less time in tcp_sendmsg(), because we call page allocator
8x less often.

Now back to the per socket page, what about trying to factorize it ?

Since we can sleep (or/and do a cpu migration) in tcp_sendmsg(), we cant
really use a percpu page reserve as we do in __netdev_alloc_frag()

We could instead use a per thread reserve, at the cost of adding a test
in task exit handler.

Recap :

1) Use a per thread page reserve instead of a per socket one
2) Use order-3 pages (or order-0 pages if page size is >= 32768)

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC] tcp: use order-3 pages in tcp_sendmsg()
  2012-09-17  7:49 [RFC] tcp: use order-3 pages in tcp_sendmsg() Eric Dumazet
@ 2012-09-17 16:12 ` David Miller
  2012-09-17 17:02   ` Eric Dumazet
  2012-11-15  7:52 ` Yan, Zheng 
  1 sibling, 1 reply; 37+ messages in thread
From: David Miller @ 2012-09-17 16:12 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 17 Sep 2012 09:49:04 +0200

> 2) Use order-3 pages (or order-0 pages if page size is >= 32768)

We could do with an audit to make sure drivers (and the stack in
general) can handle SKB frags of length > PAGE_SIZE.

I have no idea whether such problems might actually exist, but
I can say it's a case that gets not so much testing.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC] tcp: use order-3 pages in tcp_sendmsg()
  2012-09-17 16:12 ` David Miller
@ 2012-09-17 17:02   ` Eric Dumazet
  2012-09-17 17:04     ` Eric Dumazet
  0 siblings, 1 reply; 37+ messages in thread
From: Eric Dumazet @ 2012-09-17 17:02 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

On Mon, 2012-09-17 at 12:12 -0400, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Mon, 17 Sep 2012 09:49:04 +0200
> 
> > 2) Use order-3 pages (or order-0 pages if page size is >= 32768)
> 
> We could do with an audit to make sure drivers (and the stack in
> general) can handle SKB frags of length > PAGE_SIZE.
> 
> I have no idea whether such problems might actually exist, but
> I can say it's a case that gets not so much testing.

I did a (quick) audit and it appears some NIC have limits like 16KB,
but they have helpers to support this, since some arches have
PAGE_SIZE=65536

ixgbe is an example, although it might need some tweaking if this code
path was not tested.

On the other hand, bnx2x has some special code to linearize too
fragmented skbs (in bnx2x_pkt_req_lin(), if skb_shinfo(skb)->nr_frags >=
10)

By the way I did more performance tests, and the speedup is more close
of 20 %

A driver already exports a dev->gso_max_size, dev->gso_max_segs, I guess
it could export a dev->max_seg_order (default to 0)

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC] tcp: use order-3 pages in tcp_sendmsg()
  2012-09-17 17:02   ` Eric Dumazet
@ 2012-09-17 17:04     ` Eric Dumazet
  2012-09-17 17:07       ` David Miller
  0 siblings, 1 reply; 37+ messages in thread
From: Eric Dumazet @ 2012-09-17 17:04 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

On Mon, 2012-09-17 at 19:02 +0200, Eric Dumazet wrote:

> A driver already exports a dev->gso_max_size, dev->gso_max_segs, I guess
> it could export a dev->max_seg_order (default to 0)

Oh well, if we use a per thread order-3 page, a driver wont define an
order, but the max size of a segment (dev->max_seg_size).

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC] tcp: use order-3 pages in tcp_sendmsg()
  2012-09-17 17:04     ` Eric Dumazet
@ 2012-09-17 17:07       ` David Miller
  2012-09-19 15:14         ` Eric Dumazet
  0 siblings, 1 reply; 37+ messages in thread
From: David Miller @ 2012-09-17 17:07 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 17 Sep 2012 19:04:53 +0200

> On Mon, 2012-09-17 at 19:02 +0200, Eric Dumazet wrote:
> 
>> A driver already exports a dev->gso_max_size, dev->gso_max_segs, I guess
>> it could export a dev->max_seg_order (default to 0)
> 
> Oh well, if we use a per thread order-3 page, a driver wont define an
> order, but the max size of a segment (dev->max_seg_size).

Since you said that your audit showed that most can handle arbitrary
segment sizes, it's better to default to infinity or similar.

Otherwise we'll have to annotate almost every single driver with a
non-zero value, that's not an efficient way to handle this and
deploy the higher performance quickly.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC] tcp: use order-3 pages in tcp_sendmsg()
  2012-09-17 17:07       ` David Miller
@ 2012-09-19 15:14         ` Eric Dumazet
  2012-09-19 17:28           ` Rick Jones
                             ` (3 more replies)
  0 siblings, 4 replies; 37+ messages in thread
From: Eric Dumazet @ 2012-09-19 15:14 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

On Mon, 2012-09-17 at 13:07 -0400, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Mon, 17 Sep 2012 19:04:53 +0200
> 
> > On Mon, 2012-09-17 at 19:02 +0200, Eric Dumazet wrote:
> > 
> >> A driver already exports a dev->gso_max_size, dev->gso_max_segs, I guess
> >> it could export a dev->max_seg_order (default to 0)
> > 
> > Oh well, if we use a per thread order-3 page, a driver wont define an
> > order, but the max size of a segment (dev->max_seg_size).
> 
> Since you said that your audit showed that most can handle arbitrary
> segment sizes, it's better to default to infinity or similar.
> 
> Otherwise we'll have to annotate almost every single driver with a
> non-zero value, that's not an efficient way to handle this and
> deploy the higher performance quickly.

I did some tests and got no problem so far, even using splice() [ this
one was tricky because it only deals with order-0 pages at this moment ]

NIC tested : ixgbe, igb, bnx2x, tg3, mellanox mlx4

On loopback, performance of netperf goes from 31900 Mb/s to 38500 Mb/s,
thats a 20 % increase.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC] tcp: use order-3 pages in tcp_sendmsg()
  2012-09-19 15:14         ` Eric Dumazet
@ 2012-09-19 17:28           ` Rick Jones
  2012-09-19 17:55             ` Eric Dumazet
  2012-09-19 17:56           ` David Miller
                             ` (2 subsequent siblings)
  3 siblings, 1 reply; 37+ messages in thread
From: Rick Jones @ 2012-09-19 17:28 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev

On 09/19/2012 08:14 AM, Eric Dumazet wrote:
> I did some tests and got no problem so far, even using splice() [ this
> one was tricky because it only deals with order-0 pages at this moment ]
>
> NIC tested : ixgbe, igb, bnx2x, tg3, mellanox mlx4
>
> On loopback, performance of netperf goes from 31900 Mb/s to 38500 Mb/s,
> thats a 20 % increase.

I guess Brutus will need a new baseline for his TCP Friends patch then :)

BTW, what is the change, if any for TCP_RR?

happy benchmarking,

rick jones

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC] tcp: use order-3 pages in tcp_sendmsg()
  2012-09-19 17:28           ` Rick Jones
@ 2012-09-19 17:55             ` Eric Dumazet
  0 siblings, 0 replies; 37+ messages in thread
From: Eric Dumazet @ 2012-09-19 17:55 UTC (permalink / raw)
  To: Rick Jones; +Cc: David Miller, netdev

On Wed, 2012-09-19 at 10:28 -0700, Rick Jones wrote:
> On 09/19/2012 08:14 AM, Eric Dumazet wrote:
> > I did some tests and got no problem so far, even using splice() [ this
> > one was tricky because it only deals with order-0 pages at this moment ]
> >
> > NIC tested : ixgbe, igb, bnx2x, tg3, mellanox mlx4
> >
> > On loopback, performance of netperf goes from 31900 Mb/s to 38500 Mb/s,
> > thats a 20 % increase.
> 
> I guess Brutus will need a new baseline for his TCP Friends patch then :)
> 
> BTW, what is the change, if any for TCP_RR?
> 
> happy benchmarking,
> 
> rick jones
> 

No difference, because I already optimized this case last year ;)


commit f07d960df33c5aef8f513efce0fd201f962f94a1
Author: Eric Dumazet <eric.dumazet@gmail.com>
Date:   Mon Nov 28 22:41:47 2011 +0000

    tcp: avoid frag allocation for small frames
    
    tcp_sendmsg() uses select_size() helper to choose skb head size when a
    new skb must be allocated.
    
    If GSO is enabled for the socket, current strategy is to force all
    payload data to be outside of headroom, in PAGE fragments.
    
    This strategy is not welcome for small packets, wasting memory.
    
    Experiments show that best results are obtained when using 2048 bytes
    for skb head (This includes the skb overhead and various headers)
    
    This patch provides better len/truesize ratios for packets sent to
    loopback device, and reduce memory needs for in-flight loopback packets,
    particularly on arches with big pages.
    
    If a sender sends many 1-byte packets to an unresponsive application,
    receiver rmem_alloc will grow faster and will stop queuing these packets
    sooner, or will collapse its receive queue to free excess memory.
    
    netperf -t TCP_RR results are improved by ~4 %, and many workloads are
    improved as well (tbench, mysql...)

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC] tcp: use order-3 pages in tcp_sendmsg()
  2012-09-19 15:14         ` Eric Dumazet
  2012-09-19 17:28           ` Rick Jones
@ 2012-09-19 17:56           ` David Miller
  2012-09-19 19:04             ` Alexander Duyck
  2012-09-19 20:18           ` Ben Hutchings
  2012-09-19 22:20           ` Vijay Subramanian
  3 siblings, 1 reply; 37+ messages in thread
From: David Miller @ 2012-09-19 17:56 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 19 Sep 2012 17:14:19 +0200

> I did some tests and got no problem so far, even using splice() [ this
> one was tricky because it only deals with order-0 pages at this moment ]
> 
> NIC tested : ixgbe, igb, bnx2x, tg3, mellanox mlx4
> 
> On loopback, performance of netperf goes from 31900 Mb/s to 38500 Mb/s,
> thats a 20 % increase.

That's really a lot more than I expected, nice.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC] tcp: use order-3 pages in tcp_sendmsg()
  2012-09-19 17:56           ` David Miller
@ 2012-09-19 19:04             ` Alexander Duyck
  0 siblings, 0 replies; 37+ messages in thread
From: Alexander Duyck @ 2012-09-19 19:04 UTC (permalink / raw)
  To: David Miller; +Cc: eric.dumazet, netdev

On 09/19/2012 10:56 AM, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Wed, 19 Sep 2012 17:14:19 +0200
>
>> I did some tests and got no problem so far, even using splice() [ this
>> one was tricky because it only deals with order-0 pages at this moment ]
>>
>> NIC tested : ixgbe, igb, bnx2x, tg3, mellanox mlx4
>>
>> On loopback, performance of netperf goes from 31900 Mb/s to 38500 Mb/s,
>> thats a 20 % increase.
> That's really a lot more than I expected, nice.

When I get some time I will test this patch on a system with an iommu
enabled.  I suspect it will have a huge performance impact there since
now you would be looking at roughly 1/8th the total number of map/unmap
calls on a system with 4K pages.

Thanks,

Alex

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC] tcp: use order-3 pages in tcp_sendmsg()
  2012-09-19 15:14         ` Eric Dumazet
  2012-09-19 17:28           ` Rick Jones
  2012-09-19 17:56           ` David Miller
@ 2012-09-19 20:18           ` Ben Hutchings
  2012-09-19 22:20           ` Vijay Subramanian
  3 siblings, 0 replies; 37+ messages in thread
From: Ben Hutchings @ 2012-09-19 20:18 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev

On Wed, 2012-09-19 at 17:14 +0200, Eric Dumazet wrote:
> On Mon, 2012-09-17 at 13:07 -0400, David Miller wrote:
> > From: Eric Dumazet <eric.dumazet@gmail.com>
> > Date: Mon, 17 Sep 2012 19:04:53 +0200
> > 
> > > On Mon, 2012-09-17 at 19:02 +0200, Eric Dumazet wrote:
> > > 
> > >> A driver already exports a dev->gso_max_size, dev->gso_max_segs, I guess
> > >> it could export a dev->max_seg_order (default to 0)
> > > 
> > > Oh well, if we use a per thread order-3 page, a driver wont define an
> > > order, but the max size of a segment (dev->max_seg_size).
> > 
> > Since you said that your audit showed that most can handle arbitrary
> > segment sizes, it's better to default to infinity or similar.
> > 
> > Otherwise we'll have to annotate almost every single driver with a
> > non-zero value, that's not an efficient way to handle this and
> > deploy the higher performance quickly.
> 
> I did some tests and got no problem so far, even using splice() [ this
> one was tricky because it only deals with order-0 pages at this moment ]
> 
> NIC tested : ixgbe, igb, bnx2x, tg3, mellanox mlx4

I think sfc would also be fine with this; we split at 4K boundaries
regardless of the host page size.

My only concern is fragmentation on busy machines making high-order
allocations more prone to failure (though this change might well slow
that fragmentation).  The larger allocation size should at least be made
dependent on (sk->sk_allocation & GFP_KERNEL) == GPF_KERNEL.  (Even
then, I've seen some stress test failures where ring reallocation
(similar size, GFP_KERNEL) fails.  But those were done with an older
kernel version and the current mm should do better.)

Ben.

> On loopback, performance of netperf goes from 31900 Mb/s to 38500 Mb/s,
> thats a 20 % increase.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC] tcp: use order-3 pages in tcp_sendmsg()
  2012-09-19 15:14         ` Eric Dumazet
                             ` (2 preceding siblings ...)
  2012-09-19 20:18           ` Ben Hutchings
@ 2012-09-19 22:20           ` Vijay Subramanian
  2012-09-20  5:37             ` Eric Dumazet
  3 siblings, 1 reply; 37+ messages in thread
From: Vijay Subramanian @ 2012-09-19 22:20 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev

> I did some tests and got no problem so far, even using splice() [ this
> one was tricky because it only deals with order-0 pages at this moment ]
>
> NIC tested : ixgbe, igb, bnx2x, tg3, mellanox mlx4


I applied this patch to net-next and tested with e1000e driver.
With iperf I got around 8 % improvement on loopback.

Tested-by: Vijay Subramanian <subramanian.vijay@gmail.com>


Vijay

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC] tcp: use order-3 pages in tcp_sendmsg()
  2012-09-19 22:20           ` Vijay Subramanian
@ 2012-09-20  5:37             ` Eric Dumazet
  2012-09-20 17:10               ` Rick Jones
                                 ` (2 more replies)
  0 siblings, 3 replies; 37+ messages in thread
From: Eric Dumazet @ 2012-09-20  5:37 UTC (permalink / raw)
  To: Vijay Subramanian; +Cc: David Miller, netdev

On Wed, 2012-09-19 at 15:20 -0700, Vijay Subramanian wrote:
> > I did some tests and got no problem so far, even using splice() [ this
> > one was tricky because it only deals with order-0 pages at this moment ]
> >
> > NIC tested : ixgbe, igb, bnx2x, tg3, mellanox mlx4
> 
> 
> I applied this patch to net-next and tested with e1000e driver.
> With iperf I got around 8 % improvement on loopback.
> 
> Tested-by: Vijay Subramanian <subramanian.vijay@gmail.com>
> 
> 
> Vijay

If you keep the producer and consumer on separate cpus, and use large
enough send() (64KB or 128KB), gain is more like 15 or 20%

iperf uses 8KB writes, while netperf uses a 16KB default.

TCP stack has a problem because /proc/sys/net/ipv4/tcp_reordering
default value (3) is too small for loopback, since a packet contains 4
MSS

A single reorder and some packets are retransmitted.

Following setting is better

echo 16 >/proc/sys/net/ipv4/tcp_reordering

loopback is lossless, so its always surprising we can have TCP
retransmits on this medium ;)

Thanks

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC] tcp: use order-3 pages in tcp_sendmsg()
  2012-09-20  5:37             ` Eric Dumazet
@ 2012-09-20 17:10               ` Rick Jones
  2012-09-20 17:43                 ` Eric Dumazet
  2012-09-20 19:40                 ` David Miller
  2012-09-20 21:39               ` Vijay Subramanian
  2012-09-20 22:01               ` Rick Jones
  2 siblings, 2 replies; 37+ messages in thread
From: Rick Jones @ 2012-09-20 17:10 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Vijay Subramanian, David Miller, netdev

On 09/19/2012 10:37 PM, Eric Dumazet wrote:
> loopback is lossless, so its always surprising we can have TCP
> retransmits on this medium ;)

Is it lossless?

raj@tardy:~/netperf2_trunk$ netstat -s | grep pru
     19 packets pruned from receive queue because of socket buffer overrun

raj@tardy:~/netperf2_trunk$ src/netperf -t TCP_RR -- -b 256 -D -o 
burst_size,local_transport_retrans,remote_transport_retrans
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET 
to localhost.localdomain () port 0 AF_INET : nodelay : histogram : demo 
: first burst 256
Initial Burst Requests,Local Transport Retransmissions,Remote Transport 
Retransmissions
256,151,94

raj@tardy:~/netperf2_trunk$ netstat -s | grep pru    26 packets pruned 
from receive queue because of socket buffer overrun
raj@tardy:~/netperf2_trunk$ uname -a
Linux tardy 2.6.38-16-generic #67-Ubuntu SMP Thu Sep 6 17:58:38 UTC 2012 
x86_64 x86_64 x86_64 GNU/Linux

Admittedly, my test is on an older kernel, but have things changed in 
this regard since then?   I had to get a bit more contrived on a later 
kernel in a VM (vs what is running directly on my workstation):

raj@tardy-ubuntu-1204:~$ netstat -s | grep -e prune -e retrans    1 
segments retransmited
     4 packets pruned from receive queue because of socket buffer overrun
     1 fast retransmits
raj@tardy-ubuntu-1204:~$ netperf -t TCP_RR -- -b 1024 -D -S 16K -o 
local_transport_retrans,remote_transport_retrans
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET 
to localhost () port 0 AF_INET : nodelay : demo : first burst 1024
Local Transport Retransmissions,Remote Transport Retransmissions
1,0
raj@tardy-ubuntu-1204:~$ netstat -s | grep -e prune -e retrans    2 
segments retransmited
     7 packets pruned from receive queue because of socket buffer overrun
     2 fast retransmits
raj@tardy-ubuntu-1204:~$ uname -a
Linux tardy-ubuntu-1204 3.6.0-rc3+ #7 SMP Mon Sep 10 14:46:05 PDT 2012 
x86_64 x86_64 x86_64 GNU/Linux

rick

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC] tcp: use order-3 pages in tcp_sendmsg()
  2012-09-20 17:10               ` Rick Jones
@ 2012-09-20 17:43                 ` Eric Dumazet
  2012-09-20 18:37                   ` Yuchung Cheng
  2012-09-20 19:40                 ` David Miller
  1 sibling, 1 reply; 37+ messages in thread
From: Eric Dumazet @ 2012-09-20 17:43 UTC (permalink / raw)
  To: Rick Jones; +Cc: Vijay Subramanian, David Miller, netdev

On Thu, 2012-09-20 at 10:10 -0700, Rick Jones wrote:
> On 09/19/2012 10:37 PM, Eric Dumazet wrote:
> > loopback is lossless, so its always surprising we can have TCP
> > retransmits on this medium ;)
> 
> Is it lossless?
> 

It is lossless, yes.

But packets can be dropped by TCP stack for various reasons, including
reordering and retransmits.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC] tcp: use order-3 pages in tcp_sendmsg()
  2012-09-20 17:43                 ` Eric Dumazet
@ 2012-09-20 18:37                   ` Yuchung Cheng
  0 siblings, 0 replies; 37+ messages in thread
From: Yuchung Cheng @ 2012-09-20 18:37 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Rick Jones, Vijay Subramanian, David Miller, netdev

On Thu, Sep 20, 2012 at 10:43 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Thu, 2012-09-20 at 10:10 -0700, Rick Jones wrote:
>> On 09/19/2012 10:37 PM, Eric Dumazet wrote:
>> > loopback is lossless, so its always surprising we can have TCP
>> > retransmits on this medium ;)
>>
>> Is it lossless?
>>
>
> It is lossless, yes.
>
> But packets can be dropped by TCP stack for various reasons, including
> reordering and retransmits.
I'd recommend checking reordering stats. If it's lose less, set
tp->reordering = 127 for loopback.
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC] tcp: use order-3 pages in tcp_sendmsg()
  2012-09-20 17:10               ` Rick Jones
  2012-09-20 17:43                 ` Eric Dumazet
@ 2012-09-20 19:40                 ` David Miller
  2012-09-20 20:06                   ` Rick Jones
  1 sibling, 1 reply; 37+ messages in thread
From: David Miller @ 2012-09-20 19:40 UTC (permalink / raw)
  To: rick.jones2; +Cc: eric.dumazet, subramanian.vijay, netdev

From: Rick Jones <rick.jones2@hp.com>
Date: Thu, 20 Sep 2012 10:10:43 -0700

> On 09/19/2012 10:37 PM, Eric Dumazet wrote:
>> loopback is lossless, so its always surprising we can have TCP
>> retransmits on this medium ;)
> 
> Is it lossless?
> 
> raj@tardy:~/netperf2_trunk$ netstat -s | grep pru
>     19 packets pruned from receive queue because of socket buffer overrun

Those packets are not being dropped by the loopback device.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC] tcp: use order-3 pages in tcp_sendmsg()
  2012-09-20 19:40                 ` David Miller
@ 2012-09-20 20:06                   ` Rick Jones
  2012-09-20 20:25                     ` Eric Dumazet
  0 siblings, 1 reply; 37+ messages in thread
From: Rick Jones @ 2012-09-20 20:06 UTC (permalink / raw)
  To: David Miller; +Cc: eric.dumazet, subramanian.vijay, netdev

On 09/20/2012 12:40 PM, David Miller wrote:
> From: Rick Jones <rick.jones2@hp.com>
> Date: Thu, 20 Sep 2012 10:10:43 -0700
>
>> On 09/19/2012 10:37 PM, Eric Dumazet wrote:
>>> loopback is lossless, so its always surprising we can have TCP
>>> retransmits on this medium ;)
>>
>> Is it lossless?
>>
>> raj@tardy:~/netperf2_trunk$ netstat -s | grep pru
>>      19 packets pruned from receive queue because of socket buffer overrun
>
> Those packets are not being dropped by the loopback device.
>

Yes, I was being too fast and loose with my wording, paying more 
attention to the netperf tests than the rest of it.  While loopback may 
be lossless, TCP retransmissions over loopback shouldn't be all *that* 
surprising.

rick

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC] tcp: use order-3 pages in tcp_sendmsg()
  2012-09-20 20:06                   ` Rick Jones
@ 2012-09-20 20:25                     ` Eric Dumazet
  2012-09-21 15:48                       ` Eric Dumazet
  0 siblings, 1 reply; 37+ messages in thread
From: Eric Dumazet @ 2012-09-20 20:25 UTC (permalink / raw)
  To: Rick Jones; +Cc: David Miller, subramanian.vijay, netdev

On Thu, 2012-09-20 at 13:06 -0700, Rick Jones wrote:

> 
> Yes, I was being too fast and loose with my wording, paying more 
> attention to the netperf tests than the rest of it.  While loopback may 
> be lossless, TCP retransmissions over loopback shouldn't be all *that* 
> surprising.

Sending perfect packets (large packets) should trigger no retransmits.

In your tests, you send one-byte packets, so obviously the receiver will
drop some of them, because sk_rcvbuf limit (or the backlog limit) is hit
very fast.

(This should be less frequent with TCP coalescing that was recently
introduced : We are able to coalesce about 1600 'one-byte packets' into
a single one.)

netperf -t TCP_STREAM over loopback should not drop packets or
retransmit them.

# netstat -s|grep TCPRcvCoalesce
    TCPRcvCoalesce: 0

# netperf -t TCP_RR -- -b 1024 -D -S 16K -o
local_transport_retrans,remote_transport_retrans MIGRATED TCP
REQUEST/RESPONSE TEST from 0.0.0.0 () port 0 AF_INET to localhost ()
port 0 AF_INET : nodelay : first burst 1024
Local Transport Retransmissions,Remote Transport Retransmissions
0,0

# netstat -s|grep TCPRcvCoalesce
    TCPRcvCoalesce: 2072191

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC] tcp: use order-3 pages in tcp_sendmsg()
  2012-09-20  5:37             ` Eric Dumazet
  2012-09-20 17:10               ` Rick Jones
@ 2012-09-20 21:39               ` Vijay Subramanian
  2012-09-20 22:01               ` Rick Jones
  2 siblings, 0 replies; 37+ messages in thread
From: Vijay Subramanian @ 2012-09-20 21:39 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev

>>
>> I applied this patch to net-next and tested with e1000e driver.
>> With iperf I got around 8 % improvement on loopback.
>>
>> Tested-by: Vijay Subramanian <subramanian.vijay@gmail.com>

I think this tag should be on the thread with the actual patch. I will
reply to your patch with the Tested-by tag.

Thanks for your tips, Eric. For what its worth, here is what I found.

>
> If you keep the producer and consumer on separate cpus, and use large
> enough send() (64KB or 128KB), gain is more like 15 or 20%
>

Curiously, when I use taskset to run iperf server and client on
different cpus, throughput goes down by half
for both baseline (master branch) and with patch. Is taskset the right
way to test this?

I did notice a change in absolute throughout when I increase the
send() buffer size.
However, both the basline as well the patch showed improvement but the
relative improvement
was still around 8%.


> iperf uses 8KB writes, while netperf uses a 16KB default.

I think iperf has a bug. Both man page and comments in code claim
default buffer size for read/write is 8KB but
actual number seems to be 128KB. I believe the actual default is 128KB
not 8KB (-l option with iperf).


Thanks !
Vijay

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC] tcp: use order-3 pages in tcp_sendmsg()
  2012-09-20  5:37             ` Eric Dumazet
  2012-09-20 17:10               ` Rick Jones
  2012-09-20 21:39               ` Vijay Subramanian
@ 2012-09-20 22:01               ` Rick Jones
  2 siblings, 0 replies; 37+ messages in thread
From: Rick Jones @ 2012-09-20 22:01 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Vijay Subramanian, David Miller, netdev

On 09/19/2012 10:37 PM, Eric Dumazet wrote:
> iperf uses 8KB writes, while netperf uses a 16KB default.

For the sake of the archives and posterity, netperf does not have a 
"fixed" default send size.  The "default" will vary with platform and 
platform tuning

What netperf does (for TCP at least) is default the send size to the 
value returned after a getsockopt(SO_SNDBUF) issued against the socket 
just after it is allocated for the data connection.  If the user has 
asked for a specific socket buffer size, there will have been a 
preceding setsockopt(SO_SNDBUF) call.

So, "by default" under Linux, with no options to set the socket buffer 
size, netperf will use 16 KB so long as that is the default (initial) 
value for SO_SNDBUF.

The sequence will go something like:

1) create the data socket
2) if user asked to set socket buffer size call setsockopt()
3) call getsockopt()
4) if the user did not specify a send size, use the value returned from 
the getsockopt() call

So, if one runs netperf on a platform other than Linux, the "default" 
send size may be different.  Similarly, if running under linux, but 
net.ipv4.tcp_wmwm is tweaked, the "default" send size may be different.

happy benchmarking,

rick jones

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC] tcp: use order-3 pages in tcp_sendmsg()
  2012-09-20 20:25                     ` Eric Dumazet
@ 2012-09-21 15:48                       ` Eric Dumazet
  2012-09-21 16:27                         ` David Miller
  0 siblings, 1 reply; 37+ messages in thread
From: Eric Dumazet @ 2012-09-21 15:48 UTC (permalink / raw)
  To: Rick Jones; +Cc: David Miller, subramanian.vijay, netdev

On Thu, 2012-09-20 at 22:25 +0200, Eric Dumazet wrote:
> On Thu, 2012-09-20 at 13:06 -0700, Rick Jones wrote:
> 
> > 
> > Yes, I was being too fast and loose with my wording, paying more 
> > attention to the netperf tests than the rest of it.  While loopback may 
> > be lossless, TCP retransmissions over loopback shouldn't be all *that* 
> > surprising.
> 
> Sending perfect packets (large packets) should trigger no retransmits.

By the way, with current MTU of 16436 on loopback, max packet size is
48KB (3 MSS)

Using an mtu of 65536 allows another 25% increase of bulk performance...

(and less potential reordering effects, as a packet contains one MSS
instead of three)

There is probably a reason why lo default MTU is 16436 ?

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC] tcp: use order-3 pages in tcp_sendmsg()
  2012-09-21 15:48                       ` Eric Dumazet
@ 2012-09-21 16:27                         ` David Miller
  2012-09-21 16:51                           ` Eric Dumazet
  2012-09-23 12:47                           ` Jan Engelhardt
  0 siblings, 2 replies; 37+ messages in thread
From: David Miller @ 2012-09-21 16:27 UTC (permalink / raw)
  To: eric.dumazet; +Cc: rick.jones2, subramanian.vijay, netdev

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Fri, 21 Sep 2012 17:48:31 +0200

> There is probably a reason why lo default MTU is 16436 ?

That's what fit into L1 caches back in 1999

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC] tcp: use order-3 pages in tcp_sendmsg()
  2012-09-21 16:27                         ` David Miller
@ 2012-09-21 16:51                           ` Eric Dumazet
  2012-09-21 17:04                             ` David Miller
  2012-09-23 12:47                           ` Jan Engelhardt
  1 sibling, 1 reply; 37+ messages in thread
From: Eric Dumazet @ 2012-09-21 16:51 UTC (permalink / raw)
  To: David Miller; +Cc: rick.jones2, subramanian.vijay, netdev

On Fri, 2012-09-21 at 12:27 -0400, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Fri, 21 Sep 2012 17:48:31 +0200
> 
> > There is probably a reason why lo default MTU is 16436 ?
> 
> That's what fit into L1 caches back in 1999

I see ;)

Nowadays, we even have the NETIF_F_NOCACHE_COPY flag and
__copy_from_user_nocache()

Hmm, we can not toggle this flag on loopback yet

# ethtool -K lo tx-nocache-copy on
Could not change any device features

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC] tcp: use order-3 pages in tcp_sendmsg()
  2012-09-21 16:51                           ` Eric Dumazet
@ 2012-09-21 17:04                             ` David Miller
  2012-09-21 17:11                               ` Eric Dumazet
  0 siblings, 1 reply; 37+ messages in thread
From: David Miller @ 2012-09-21 17:04 UTC (permalink / raw)
  To: eric.dumazet; +Cc: rick.jones2, subramanian.vijay, netdev

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Fri, 21 Sep 2012 18:51:21 +0200

> On Fri, 2012-09-21 at 12:27 -0400, David Miller wrote:
>> From: Eric Dumazet <eric.dumazet@gmail.com>
>> Date: Fri, 21 Sep 2012 17:48:31 +0200
>> 
>> > There is probably a reason why lo default MTU is 16436 ?
>> 
>> That's what fit into L1 caches back in 1999
> 
> I see ;)
> 
> Nowadays, we even have the NETIF_F_NOCACHE_COPY flag and
> __copy_from_user_nocache()
> 
> Hmm, we can not toggle this flag on loopback yet
> 
> # ethtool -K lo tx-nocache-copy on
> Could not change any device features

It's a silly limitation, in net/core/dev.c:

	/* Turn on no cache copy if HW is doing checksum */
	if (!(dev->flags & IFF_LOOPBACK)) {
		dev->hw_features |= NETIF_F_NOCACHE_COPY;
		if (dev->features & NETIF_F_ALL_CSUM) {
			dev->wanted_features |= NETIF_F_NOCACHE_COPY;
			dev->features |= NETIF_F_NOCACHE_COPY;
		}
	}

Maybe this is probably better done as:

	/* Turn on no cache copy if HW is doing checksum */
	dev->hw_features |= NETIF_F_NOCACHE_COPY;
	if (!(dev->flags & IFF_LOOPBACK)) {
		if (dev->features & NETIF_F_ALL_CSUM) {
			dev->wanted_features |= NETIF_F_NOCACHE_COPY;
			dev->features |= NETIF_F_NOCACHE_COPY;
		}
	}

And then the code matches more closely the comment. :-)

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC] tcp: use order-3 pages in tcp_sendmsg()
  2012-09-21 17:04                             ` David Miller
@ 2012-09-21 17:11                               ` Eric Dumazet
  0 siblings, 0 replies; 37+ messages in thread
From: Eric Dumazet @ 2012-09-21 17:11 UTC (permalink / raw)
  To: David Miller; +Cc: rick.jones2, subramanian.vijay, netdev

On Fri, 2012-09-21 at 13:04 -0400, David Miller wrote:

> It's a silly limitation, in net/core/dev.c:
> 
> 	/* Turn on no cache copy if HW is doing checksum */
> 	if (!(dev->flags & IFF_LOOPBACK)) {
> 		dev->hw_features |= NETIF_F_NOCACHE_COPY;
> 		if (dev->features & NETIF_F_ALL_CSUM) {
> 			dev->wanted_features |= NETIF_F_NOCACHE_COPY;
> 			dev->features |= NETIF_F_NOCACHE_COPY;
> 		}
> 	}
> 
> Maybe this is probably better done as:
> 
> 	/* Turn on no cache copy if HW is doing checksum */
> 	dev->hw_features |= NETIF_F_NOCACHE_COPY;
> 	if (!(dev->flags & IFF_LOOPBACK)) {
> 		if (dev->features & NETIF_F_ALL_CSUM) {
> 			dev->wanted_features |= NETIF_F_NOCACHE_COPY;
> 			dev->features |= NETIF_F_NOCACHE_COPY;
> 		}
> 	}
> 
> And then the code matches more closely the comment. :-)

I did a test, and for various combinations (producer/consumer on same
core or not, same cpu or not...) and performance is divided by 2

So I guess we can leave the code as is

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC] tcp: use order-3 pages in tcp_sendmsg()
  2012-09-21 16:27                         ` David Miller
  2012-09-21 16:51                           ` Eric Dumazet
@ 2012-09-23 12:47                           ` Jan Engelhardt
  2012-09-23 16:16                             ` David Miller
  1 sibling, 1 reply; 37+ messages in thread
From: Jan Engelhardt @ 2012-09-23 12:47 UTC (permalink / raw)
  To: David Miller; +Cc: eric.dumazet, rick.jones2, subramanian.vijay, netdev

On Friday 2012-09-21 18:27, David Miller wrote:

>From: Eric Dumazet <eric.dumazet@gmail.com>
>Date: Fri, 21 Sep 2012 17:48:31 +0200
>
>> There is probably a reason why lo default MTU is 16436 ?
>
>That's what fit into L1 caches back in 1999

Would it make sense to automatically set lo's MTU on device registration 
to the actual size of the L1 that the $running system has?

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC] tcp: use order-3 pages in tcp_sendmsg()
  2012-09-23 12:47                           ` Jan Engelhardt
@ 2012-09-23 16:16                             ` David Miller
  2012-09-23 17:40                               ` Jan Engelhardt
  0 siblings, 1 reply; 37+ messages in thread
From: David Miller @ 2012-09-23 16:16 UTC (permalink / raw)
  To: jengelh; +Cc: eric.dumazet, rick.jones2, subramanian.vijay, netdev

From: Jan Engelhardt <jengelh@inai.de>
Date: Sun, 23 Sep 2012 14:47:28 +0200 (CEST)

> On Friday 2012-09-21 18:27, David Miller wrote:
> 
>>From: Eric Dumazet <eric.dumazet@gmail.com>
>>Date: Fri, 21 Sep 2012 17:48:31 +0200
>>
>>> There is probably a reason why lo default MTU is 16436 ?
>>
>>That's what fit into L1 caches back in 1999
> 
> Would it make sense to automatically set lo's MTU on device
> registration to the actual size of the L1 that the $running system
> has?

I think a fixed value of 64K would be a easiest, because another issue
is that back in 1999 we didn't have GRO/GSO/etc.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC] tcp: use order-3 pages in tcp_sendmsg()
  2012-09-23 16:16                             ` David Miller
@ 2012-09-23 17:40                               ` Jan Engelhardt
  2012-09-23 18:13                                 ` Eric Dumazet
  2012-09-23 18:27                                 ` David Miller
  0 siblings, 2 replies; 37+ messages in thread
From: Jan Engelhardt @ 2012-09-23 17:40 UTC (permalink / raw)
  To: David Miller; +Cc: eric.dumazet, rick.jones2, subramanian.vijay, netdev


On Sunday 2012-09-23 18:16, David Miller wrote:
>> On Friday 2012-09-21 18:27, David Miller wrote:
>> 
>>>From: Eric Dumazet <eric.dumazet@gmail.com>
>>>Date: Fri, 21 Sep 2012 17:48:31 +0200
>>>
>>>> There is probably a reason why lo default MTU is 16436 ?
>>>
>>>That's what fit into L1 caches back in 1999
>> 
>> Would it make sense to automatically set lo's MTU on device
>> registration to the actual size of the L1 that the $running system
>> has?
>
>I think a fixed value of 64K would be a easiest, because another issue
>is that back in 1999 we didn't have GRO/GSO/etc.


Cache sizes, and an oddity.

L1 cache sizes have not increased ever since (the 2011 Intel i7-2600
has the same amount of L1 as a 1998ish AMD K6-2), and the Atom N450
even has less, namely 24d+32i, meaning a 13000ish MTU might be more
accurate for netbooks of this kind.

What would a MTU of 64K buy? Offloading seems pointless for the lo
device. There is no hardware to offload it to - well, the
machine already has the packet too.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC] tcp: use order-3 pages in tcp_sendmsg()
  2012-09-23 17:40                               ` Jan Engelhardt
@ 2012-09-23 18:13                                 ` Eric Dumazet
  2012-09-23 18:27                                 ` David Miller
  1 sibling, 0 replies; 37+ messages in thread
From: Eric Dumazet @ 2012-09-23 18:13 UTC (permalink / raw)
  To: Jan Engelhardt; +Cc: David Miller, rick.jones2, subramanian.vijay, netdev

On Sun, 2012-09-23 at 19:40 +0200, Jan Engelhardt wrote:

> 
> Cache sizes, and an oddity.
> 
> L1 cache sizes have not increased ever since (the 2011 Intel i7-2600
> has the same amount of L1 as a 1998ish AMD K6-2), and the Atom N450
> even has less, namely 24d+32i, meaning a 13000ish MTU might be more
> accurate for netbooks of this kind.
> 

MTU above 64K is not really useful, at least for IPv4, I think.

> What would a MTU of 64K buy? Offloading seems pointless for the lo
> device. There is no hardware to offload it to - well, the
> machine already has the packet too.

It buys 25% increase of performance. Not too bad...

I yet have to make sure no adjustment is needed, in TCP stack for
example.

If you try to change lo mtu on an old kernel (eg 2.6.38, on my laptop),
you can notice some packets drops (TCPBacklogDrop)

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC] tcp: use order-3 pages in tcp_sendmsg()
  2012-09-23 17:40                               ` Jan Engelhardt
  2012-09-23 18:13                                 ` Eric Dumazet
@ 2012-09-23 18:27                                 ` David Miller
  1 sibling, 0 replies; 37+ messages in thread
From: David Miller @ 2012-09-23 18:27 UTC (permalink / raw)
  To: jengelh; +Cc: eric.dumazet, rick.jones2, subramanian.vijay, netdev

From: Jan Engelhardt <jengelh@inai.de>
Date: Sun, 23 Sep 2012 19:40:33 +0200 (CEST)

> What would a MTU of 64K buy? Offloading seems pointless for the lo
> device. There is no hardware to offload it to - well, the
> machine already has the packet too.

Traversing the stack once instead of N times.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC] tcp: use order-3 pages in tcp_sendmsg()
  2012-09-17  7:49 [RFC] tcp: use order-3 pages in tcp_sendmsg() Eric Dumazet
  2012-09-17 16:12 ` David Miller
@ 2012-11-15  7:52 ` Yan, Zheng 
  2012-11-15 13:06   ` Eric Dumazet
                     ` (2 more replies)
  1 sibling, 3 replies; 37+ messages in thread
From: Yan, Zheng  @ 2012-11-15  7:52 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev

On Mon, Sep 17, 2012 at 3:49 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> We currently use per socket page reserve for tcp_sendmsg() operations.
>
> Its done to raise the probability of coalescing small write() in to
> single segments in the skbs.
>
> But it wastes a lot of memory for applications handling a lot of mostly
> idle sockets, since each socket holds one page in sk->sk_sndmsg_page
>
> I did a small experiment to use order-3 pages and it gave me a 10% boost
> of performance, because each TSO skb can use only two frags of 32KB,
> instead of 16 frags of 4KB, so we spend less time in ndo_start_xmit() to
> setup the tx descriptor and TX completion path to unmap the frags and
> free them.
>
> We also spend less time in tcp_sendmsg(), because we call page allocator
> 8x less often.
>
> Now back to the per socket page, what about trying to factorize it ?
>
> Since we can sleep (or/and do a cpu migration) in tcp_sendmsg(), we cant
> really use a percpu page reserve as we do in __netdev_alloc_frag()
>
> We could instead use a per thread reserve, at the cost of adding a test
> in task exit handler.
>
> Recap :
>
> 1) Use a per thread page reserve instead of a per socket one
> 2) Use order-3 pages (or order-0 pages if page size is >= 32768)
>
>

Hi,

This commit makes one of our test case on core 2 machine drop in performance
by about 60%. The test case runs 2048 instances of netperf 64k stream test at
the same time.  Analysis showed using order-3 pages causes more LLC misses,
most new LLC misses happen when the senders copy data to the socket buffer.
If revert to use single page, the sender side only trigger a few LLC
misses, most
LLC misses happen on the receiver size. It means most pages allocated by the
senders are cache hot. But when using order-3 pages, 2048 * 32k = 64M, 64M
is much larger than LLC size. Should this regression be worried? or
our test case
is too unpractical?

Regards
Yan, Zheng

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC] tcp: use order-3 pages in tcp_sendmsg()
  2012-11-15  7:52 ` Yan, Zheng 
@ 2012-11-15 13:06   ` Eric Dumazet
  2012-11-16  2:36     ` Yan, Zheng 
  2012-11-15 13:47   ` Eric Dumazet
  2012-11-15 18:33   ` Rick Jones
  2 siblings, 1 reply; 37+ messages in thread
From: Eric Dumazet @ 2012-11-15 13:06 UTC (permalink / raw)
  To: Yan, Zheng; +Cc: netdev

On Thu, 2012-11-15 at 15:52 +0800, Yan, Zheng wrote:
> On Mon, Sep 17, 2012 at 3:49 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > We currently use per socket page reserve for tcp_sendmsg() operations.
> >
> > Its done to raise the probability of coalescing small write() in to
> > single segments in the skbs.
> >
> > But it wastes a lot of memory for applications handling a lot of mostly
> > idle sockets, since each socket holds one page in sk->sk_sndmsg_page
> >
> > I did a small experiment to use order-3 pages and it gave me a 10% boost
> > of performance, because each TSO skb can use only two frags of 32KB,
> > instead of 16 frags of 4KB, so we spend less time in ndo_start_xmit() to
> > setup the tx descriptor and TX completion path to unmap the frags and
> > free them.
> >
> > We also spend less time in tcp_sendmsg(), because we call page allocator
> > 8x less often.
> >
> > Now back to the per socket page, what about trying to factorize it ?
> >
> > Since we can sleep (or/and do a cpu migration) in tcp_sendmsg(), we cant
> > really use a percpu page reserve as we do in __netdev_alloc_frag()
> >
> > We could instead use a per thread reserve, at the cost of adding a test
> > in task exit handler.
> >
> > Recap :
> >
> > 1) Use a per thread page reserve instead of a per socket one
> > 2) Use order-3 pages (or order-0 pages if page size is >= 32768)
> >
> >
> 
> Hi,
> 
> This commit makes one of our test case on core 2 machine drop in performance
> by about 60%. The test case runs 2048 instances of netperf 64k stream test at
> the same time.  Analysis showed using order-3 pages causes more LLC misses,
> most new LLC misses happen when the senders copy data to the socket buffer.
> If revert to use single page, the sender side only trigger a few LLC
> misses, most
> LLC misses happen on the receiver size. It means most pages allocated by the
> senders are cache hot. But when using order-3 pages, 2048 * 32k = 64M, 64M
> is much larger than LLC size. Should this regression be worried? or
> our test case
> is too unpractical?

Hi Yan

You forgot to give some basic information with this mail, like the
hardware configuration, NIC driver, ...

Increasing performance can sometime change the balance you had on a
prior workload.

Number of in flight bytes do not depend on the order of the pages, but
sizes of TCP buffers (receiver, sender)

TCP Small queue was an attempt to reduce the number of in-flight bytes,
you should try to change either SO_SNDBUF or SO_RCVBUF settings (instead
of letting the system autotune them) if you really need 2048 concurrent
flows.

Otherwise, each flow can consume up to 6 MB of memory, so obviously your
cpu caches wont hold 2048*6MB of memory...

If the sender is faster (because of this commit), but receiver is slow
to drain the receive queues, then you can have a situation where the
consumed memory on receiver is higher and the receiver might be actually
slower.

Thanks

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC] tcp: use order-3 pages in tcp_sendmsg()
  2012-11-15  7:52 ` Yan, Zheng 
  2012-11-15 13:06   ` Eric Dumazet
@ 2012-11-15 13:47   ` Eric Dumazet
  2012-11-21  8:05     ` Yan, Zheng
  2012-11-15 18:33   ` Rick Jones
  2 siblings, 1 reply; 37+ messages in thread
From: Eric Dumazet @ 2012-11-15 13:47 UTC (permalink / raw)
  To: Yan, Zheng; +Cc: netdev

On Thu, 2012-11-15 at 15:52 +0800, Yan, Zheng wrote:
> LLC misses happen on the receiver size. It means most pages allocated by the
> senders are cache hot. But when using order-3 pages, 2048 * 32k = 64M, 64M
> is much larger than LLC size.

By the way, this 2048*32k is wrong, as the receiver only uses fragments
in pages, and its not related to the "tcp: use order-3 pages in
tcp_sendmsg()" commit

Many drivers use order-0 pages to hold ethernet frames, regardless of
what was used by the sender ;)

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC] tcp: use order-3 pages in tcp_sendmsg()
  2012-11-15  7:52 ` Yan, Zheng 
  2012-11-15 13:06   ` Eric Dumazet
  2012-11-15 13:47   ` Eric Dumazet
@ 2012-11-15 18:33   ` Rick Jones
  2 siblings, 0 replies; 37+ messages in thread
From: Rick Jones @ 2012-11-15 18:33 UTC (permalink / raw)
  To: Yan, Zheng ; +Cc: Eric Dumazet, netdev

On 11/14/2012 11:52 PM, Yan, Zheng wrote:
> This commit makes one of our test case on core 2 machine drop in
> performance by about 60%. The test case runs 2048 instances of
> netperf 64k stream test at the same time.

I'm impressed that 2048 concurrent netperf TCP_STREAM tests ran to 
completion in the first place :)

> Analysis showed using order-3 pages causes more LLC misses, most new
> LLC misses happen when the senders copy data to the socket buffer.

> If revert to use single page, the sender side only trigger a few LLC
>  misses, most LLC misses happen on the receiver size. It means most
> pages allocated by the senders are cache hot. But when using order-3
> pages, 2048 * 32k = 64M, 64M is much larger than LLC size. Should
> this regression be worried? or our test case is too unpractical?

Even before the page change I would have expected the buffers that 
netperf itself uses would have exceeded the LLC.  If you were not using 
test-specific -s and -S options to set an explicit socket buffer size, I 
believe that under Linux (most of the time) the default SO_SNDBUF size 
will be 86KB.  Coupled with your statement that the send size was 64K it 
means the send ring being used by netperf will be 2, 64KB buffers, which 
would then be 256MB across 2048 concurrent netperfs.  Even if we go with 
"only the one send buffer in play at a time matters" that is still 128 
MB of space up in netperf itself even before one gets to the stack.

Still, sharing the analysis tool output might be helpful.

By the way the "default" size of the buffer netperf posts in recv() 
calls will depend on the initial value of SO_RCVBUF after the data 
socket is created and had any -s or -S option values applied to it.

I cannot say that the scripts distributed with netperf are consistently 
good about doing it themselves, but I would suggest for the "canonical" 
bulk streak test something like:

netperf -t TCP_STREAM -H <dest> -l 60 -- -s 1M -S 1M -m 64K -M 64K

as that will reduce the number of variables.  Those -s and -S values 
though will probably call for tweaking sysctl settings or they will be 
clipped by net.core.rmem_max and net.core.wmem_max.  At a minimum I 
would suggest having the -m and -M option.  I might also tack-on a "-o 
all" at the end, but that is a matter of preference - it will cause a 
great deal of output...

Eric Dumazet later says:
> Number of in flight bytes do not depend on the order of the pages, but
> sizes of TCP buffers (receiver, sender)

And unless you happened to use explicit -s and -S options, there is even 
more variability in how much may be inflight.  If you do not add those 
you can at least get netperf to report what the socket buffer sizes 
became by the end of the test:

netperf -t TCP_STREAM ... -- ... -o lss_size_end,rsr_size_end

for "local socket send size" and "remote socket receive size" respectively.

> If the sender is faster (because of this commit), but receiver is slow
> to drain the receive queues, then you can have a situation where the
> consumed memory on receiver is higher and the receiver might be actually
> slower.

Netperf can be told to report the number of receive calls and the bytes 
per receive - either by tacking-on a global "-v 2" or by requesting them 
explicitly via omni output selection.  Presumably, if the receiving 
netserver processes are not keeping-up as well, that should manifest as 
the bytes per receive being larger in the "after" case than the "before" 
case.

netperf ... -- ... -o 
remote_recv_size,remote_recv_calls,remote_bytes_per_recv

happy benchmarking,

rick jones

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC] tcp: use order-3 pages in tcp_sendmsg()
  2012-11-15 13:06   ` Eric Dumazet
@ 2012-11-16  2:36     ` Yan, Zheng 
  0 siblings, 0 replies; 37+ messages in thread
From: Yan, Zheng  @ 2012-11-16  2:36 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, rick.jones2, Yan, Zheng

On Thu, Nov 15, 2012 at 9:06 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Thu, 2012-11-15 at 15:52 +0800, Yan, Zheng wrote:
>> On Mon, Sep 17, 2012 at 3:49 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> > We currently use per socket page reserve for tcp_sendmsg() operations.
>> >
>> > Its done to raise the probability of coalescing small write() in to
>> > single segments in the skbs.
>> >
>> > But it wastes a lot of memory for applications handling a lot of mostly
>> > idle sockets, since each socket holds one page in sk->sk_sndmsg_page
>> >
>> > I did a small experiment to use order-3 pages and it gave me a 10% boost
>> > of performance, because each TSO skb can use only two frags of 32KB,
>> > instead of 16 frags of 4KB, so we spend less time in ndo_start_xmit() to
>> > setup the tx descriptor and TX completion path to unmap the frags and
>> > free them.
>> >
>> > We also spend less time in tcp_sendmsg(), because we call page allocator
>> > 8x less often.
>> >
>> > Now back to the per socket page, what about trying to factorize it ?
>> >
>> > Since we can sleep (or/and do a cpu migration) in tcp_sendmsg(), we cant
>> > really use a percpu page reserve as we do in __netdev_alloc_frag()
>> >
>> > We could instead use a per thread reserve, at the cost of adding a test
>> > in task exit handler.
>> >
>> > Recap :
>> >
>> > 1) Use a per thread page reserve instead of a per socket one
>> > 2) Use order-3 pages (or order-0 pages if page size is >= 32768)
>> >
>> >
>>
>> Hi,
>>
>> This commit makes one of our test case on core 2 machine drop in performance
>> by about 60%. The test case runs 2048 instances of netperf 64k stream test at
>> the same time.  Analysis showed using order-3 pages causes more LLC misses,
>> most new LLC misses happen when the senders copy data to the socket buffer.
>> If revert to use single page, the sender side only trigger a few LLC
>> misses, most
>> LLC misses happen on the receiver size. It means most pages allocated by the
>> senders are cache hot. But when using order-3 pages, 2048 * 32k = 64M, 64M
>> is much larger than LLC size. Should this regression be worried? or
>> our test case
>> is too unpractical?
>
> Hi Yan
>
> You forgot to give some basic information with this mail, like the
> hardware configuration, NIC driver, ...
>
> Increasing performance can sometime change the balance you had on a
> prior workload.
>
> Number of in flight bytes do not depend on the order of the pages, but
> sizes of TCP buffers (receiver, sender)
>
> TCP Small queue was an attempt to reduce the number of in-flight bytes,
> you should try to change either SO_SNDBUF or SO_RCVBUF settings (instead
> of letting the system autotune them) if you really need 2048 concurrent
> flows.
>
> Otherwise, each flow can consume up to 6 MB of memory, so obviously your
> cpu caches wont hold 2048*6MB of memory...
>
> If the sender is faster (because of this commit), but receiver is slow
> to drain the receive queues, then you can have a situation where the
> consumed memory on receiver is higher and the receiver might be actually
> slower.
>

I'm sorry, I forgot to mention the test ran on loopback device. It's
one test case in
our kernel performance test project. This test case is very sensitive to memory
allocation and scheduler behavior changes.

Regards
Yan, Zheng

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC] tcp: use order-3 pages in tcp_sendmsg()
  2012-11-15 13:47   ` Eric Dumazet
@ 2012-11-21  8:05     ` Yan, Zheng
  0 siblings, 0 replies; 37+ messages in thread
From: Yan, Zheng @ 2012-11-21  8:05 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev

On Thu, Nov 15, 2012 at 9:47 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Thu, 2012-11-15 at 15:52 +0800, Yan, Zheng wrote:
>> LLC misses happen on the receiver size. It means most pages allocated by the
>> senders are cache hot. But when using order-3 pages, 2048 * 32k = 64M, 64M
>> is much larger than LLC size.
>
> By the way, this 2048*32k is wrong, as the receiver only uses fragments
> in pages, and its not related to the "tcp: use order-3 pages in
> tcp_sendmsg()" commit
>
> Many drivers use order-0 pages to hold ethernet frames, regardless of
> what was used by the sender ;)
>
>
Hi,

I think we found root cause of this regression. The test case runs
2048 instance of
netperf TCP loopback stream test on a two sockets core2 machine. There is more
LLC misses when using order-3 pages. core2 is not NUMA architecture, there is
only one memory node. Order-3 pages used by one socket may be later re-used by
another socket, which causes lots of LLC invalidation.  Using order-0
page doesn't
have this issue is because the kernel page allocator uses per-cpu list
to optimize
order-0 page allocation.

Regards
Yan, Zheng

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2012-11-21  8:05 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-09-17  7:49 [RFC] tcp: use order-3 pages in tcp_sendmsg() Eric Dumazet
2012-09-17 16:12 ` David Miller
2012-09-17 17:02   ` Eric Dumazet
2012-09-17 17:04     ` Eric Dumazet
2012-09-17 17:07       ` David Miller
2012-09-19 15:14         ` Eric Dumazet
2012-09-19 17:28           ` Rick Jones
2012-09-19 17:55             ` Eric Dumazet
2012-09-19 17:56           ` David Miller
2012-09-19 19:04             ` Alexander Duyck
2012-09-19 20:18           ` Ben Hutchings
2012-09-19 22:20           ` Vijay Subramanian
2012-09-20  5:37             ` Eric Dumazet
2012-09-20 17:10               ` Rick Jones
2012-09-20 17:43                 ` Eric Dumazet
2012-09-20 18:37                   ` Yuchung Cheng
2012-09-20 19:40                 ` David Miller
2012-09-20 20:06                   ` Rick Jones
2012-09-20 20:25                     ` Eric Dumazet
2012-09-21 15:48                       ` Eric Dumazet
2012-09-21 16:27                         ` David Miller
2012-09-21 16:51                           ` Eric Dumazet
2012-09-21 17:04                             ` David Miller
2012-09-21 17:11                               ` Eric Dumazet
2012-09-23 12:47                           ` Jan Engelhardt
2012-09-23 16:16                             ` David Miller
2012-09-23 17:40                               ` Jan Engelhardt
2012-09-23 18:13                                 ` Eric Dumazet
2012-09-23 18:27                                 ` David Miller
2012-09-20 21:39               ` Vijay Subramanian
2012-09-20 22:01               ` Rick Jones
2012-11-15  7:52 ` Yan, Zheng 
2012-11-15 13:06   ` Eric Dumazet
2012-11-16  2:36     ` Yan, Zheng 
2012-11-15 13:47   ` Eric Dumazet
2012-11-21  8:05     ` Yan, Zheng
2012-11-15 18:33   ` Rick Jones

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).