* [RFC] tcp: use order-3 pages in tcp_sendmsg()
@ 2012-09-17 7:49 Eric Dumazet
2012-09-17 16:12 ` David Miller
2012-11-15 7:52 ` Yan, Zheng
0 siblings, 2 replies; 37+ messages in thread
From: Eric Dumazet @ 2012-09-17 7:49 UTC (permalink / raw)
To: netdev
We currently use per socket page reserve for tcp_sendmsg() operations.
Its done to raise the probability of coalescing small write() in to
single segments in the skbs.
But it wastes a lot of memory for applications handling a lot of mostly
idle sockets, since each socket holds one page in sk->sk_sndmsg_page
I did a small experiment to use order-3 pages and it gave me a 10% boost
of performance, because each TSO skb can use only two frags of 32KB,
instead of 16 frags of 4KB, so we spend less time in ndo_start_xmit() to
setup the tx descriptor and TX completion path to unmap the frags and
free them.
We also spend less time in tcp_sendmsg(), because we call page allocator
8x less often.
Now back to the per socket page, what about trying to factorize it ?
Since we can sleep (or/and do a cpu migration) in tcp_sendmsg(), we cant
really use a percpu page reserve as we do in __netdev_alloc_frag()
We could instead use a per thread reserve, at the cost of adding a test
in task exit handler.
Recap :
1) Use a per thread page reserve instead of a per socket one
2) Use order-3 pages (or order-0 pages if page size is >= 32768)
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC] tcp: use order-3 pages in tcp_sendmsg()
2012-09-17 7:49 [RFC] tcp: use order-3 pages in tcp_sendmsg() Eric Dumazet
@ 2012-09-17 16:12 ` David Miller
2012-09-17 17:02 ` Eric Dumazet
2012-11-15 7:52 ` Yan, Zheng
1 sibling, 1 reply; 37+ messages in thread
From: David Miller @ 2012-09-17 16:12 UTC (permalink / raw)
To: eric.dumazet; +Cc: netdev
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 17 Sep 2012 09:49:04 +0200
> 2) Use order-3 pages (or order-0 pages if page size is >= 32768)
We could do with an audit to make sure drivers (and the stack in
general) can handle SKB frags of length > PAGE_SIZE.
I have no idea whether such problems might actually exist, but
I can say it's a case that gets not so much testing.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC] tcp: use order-3 pages in tcp_sendmsg()
2012-09-17 16:12 ` David Miller
@ 2012-09-17 17:02 ` Eric Dumazet
2012-09-17 17:04 ` Eric Dumazet
0 siblings, 1 reply; 37+ messages in thread
From: Eric Dumazet @ 2012-09-17 17:02 UTC (permalink / raw)
To: David Miller; +Cc: netdev
On Mon, 2012-09-17 at 12:12 -0400, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Mon, 17 Sep 2012 09:49:04 +0200
>
> > 2) Use order-3 pages (or order-0 pages if page size is >= 32768)
>
> We could do with an audit to make sure drivers (and the stack in
> general) can handle SKB frags of length > PAGE_SIZE.
>
> I have no idea whether such problems might actually exist, but
> I can say it's a case that gets not so much testing.
I did a (quick) audit and it appears some NIC have limits like 16KB,
but they have helpers to support this, since some arches have
PAGE_SIZE=65536
ixgbe is an example, although it might need some tweaking if this code
path was not tested.
On the other hand, bnx2x has some special code to linearize too
fragmented skbs (in bnx2x_pkt_req_lin(), if skb_shinfo(skb)->nr_frags >=
10)
By the way I did more performance tests, and the speedup is more close
of 20 %
A driver already exports a dev->gso_max_size, dev->gso_max_segs, I guess
it could export a dev->max_seg_order (default to 0)
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC] tcp: use order-3 pages in tcp_sendmsg()
2012-09-17 17:02 ` Eric Dumazet
@ 2012-09-17 17:04 ` Eric Dumazet
2012-09-17 17:07 ` David Miller
0 siblings, 1 reply; 37+ messages in thread
From: Eric Dumazet @ 2012-09-17 17:04 UTC (permalink / raw)
To: David Miller; +Cc: netdev
On Mon, 2012-09-17 at 19:02 +0200, Eric Dumazet wrote:
> A driver already exports a dev->gso_max_size, dev->gso_max_segs, I guess
> it could export a dev->max_seg_order (default to 0)
Oh well, if we use a per thread order-3 page, a driver wont define an
order, but the max size of a segment (dev->max_seg_size).
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC] tcp: use order-3 pages in tcp_sendmsg()
2012-09-17 17:04 ` Eric Dumazet
@ 2012-09-17 17:07 ` David Miller
2012-09-19 15:14 ` Eric Dumazet
0 siblings, 1 reply; 37+ messages in thread
From: David Miller @ 2012-09-17 17:07 UTC (permalink / raw)
To: eric.dumazet; +Cc: netdev
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 17 Sep 2012 19:04:53 +0200
> On Mon, 2012-09-17 at 19:02 +0200, Eric Dumazet wrote:
>
>> A driver already exports a dev->gso_max_size, dev->gso_max_segs, I guess
>> it could export a dev->max_seg_order (default to 0)
>
> Oh well, if we use a per thread order-3 page, a driver wont define an
> order, but the max size of a segment (dev->max_seg_size).
Since you said that your audit showed that most can handle arbitrary
segment sizes, it's better to default to infinity or similar.
Otherwise we'll have to annotate almost every single driver with a
non-zero value, that's not an efficient way to handle this and
deploy the higher performance quickly.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC] tcp: use order-3 pages in tcp_sendmsg()
2012-09-17 17:07 ` David Miller
@ 2012-09-19 15:14 ` Eric Dumazet
2012-09-19 17:28 ` Rick Jones
` (3 more replies)
0 siblings, 4 replies; 37+ messages in thread
From: Eric Dumazet @ 2012-09-19 15:14 UTC (permalink / raw)
To: David Miller; +Cc: netdev
On Mon, 2012-09-17 at 13:07 -0400, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Mon, 17 Sep 2012 19:04:53 +0200
>
> > On Mon, 2012-09-17 at 19:02 +0200, Eric Dumazet wrote:
> >
> >> A driver already exports a dev->gso_max_size, dev->gso_max_segs, I guess
> >> it could export a dev->max_seg_order (default to 0)
> >
> > Oh well, if we use a per thread order-3 page, a driver wont define an
> > order, but the max size of a segment (dev->max_seg_size).
>
> Since you said that your audit showed that most can handle arbitrary
> segment sizes, it's better to default to infinity or similar.
>
> Otherwise we'll have to annotate almost every single driver with a
> non-zero value, that's not an efficient way to handle this and
> deploy the higher performance quickly.
I did some tests and got no problem so far, even using splice() [ this
one was tricky because it only deals with order-0 pages at this moment ]
NIC tested : ixgbe, igb, bnx2x, tg3, mellanox mlx4
On loopback, performance of netperf goes from 31900 Mb/s to 38500 Mb/s,
thats a 20 % increase.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC] tcp: use order-3 pages in tcp_sendmsg()
2012-09-19 15:14 ` Eric Dumazet
@ 2012-09-19 17:28 ` Rick Jones
2012-09-19 17:55 ` Eric Dumazet
2012-09-19 17:56 ` David Miller
` (2 subsequent siblings)
3 siblings, 1 reply; 37+ messages in thread
From: Rick Jones @ 2012-09-19 17:28 UTC (permalink / raw)
To: Eric Dumazet; +Cc: David Miller, netdev
On 09/19/2012 08:14 AM, Eric Dumazet wrote:
> I did some tests and got no problem so far, even using splice() [ this
> one was tricky because it only deals with order-0 pages at this moment ]
>
> NIC tested : ixgbe, igb, bnx2x, tg3, mellanox mlx4
>
> On loopback, performance of netperf goes from 31900 Mb/s to 38500 Mb/s,
> thats a 20 % increase.
I guess Brutus will need a new baseline for his TCP Friends patch then :)
BTW, what is the change, if any for TCP_RR?
happy benchmarking,
rick jones
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC] tcp: use order-3 pages in tcp_sendmsg()
2012-09-19 17:28 ` Rick Jones
@ 2012-09-19 17:55 ` Eric Dumazet
0 siblings, 0 replies; 37+ messages in thread
From: Eric Dumazet @ 2012-09-19 17:55 UTC (permalink / raw)
To: Rick Jones; +Cc: David Miller, netdev
On Wed, 2012-09-19 at 10:28 -0700, Rick Jones wrote:
> On 09/19/2012 08:14 AM, Eric Dumazet wrote:
> > I did some tests and got no problem so far, even using splice() [ this
> > one was tricky because it only deals with order-0 pages at this moment ]
> >
> > NIC tested : ixgbe, igb, bnx2x, tg3, mellanox mlx4
> >
> > On loopback, performance of netperf goes from 31900 Mb/s to 38500 Mb/s,
> > thats a 20 % increase.
>
> I guess Brutus will need a new baseline for his TCP Friends patch then :)
>
> BTW, what is the change, if any for TCP_RR?
>
> happy benchmarking,
>
> rick jones
>
No difference, because I already optimized this case last year ;)
commit f07d960df33c5aef8f513efce0fd201f962f94a1
Author: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon Nov 28 22:41:47 2011 +0000
tcp: avoid frag allocation for small frames
tcp_sendmsg() uses select_size() helper to choose skb head size when a
new skb must be allocated.
If GSO is enabled for the socket, current strategy is to force all
payload data to be outside of headroom, in PAGE fragments.
This strategy is not welcome for small packets, wasting memory.
Experiments show that best results are obtained when using 2048 bytes
for skb head (This includes the skb overhead and various headers)
This patch provides better len/truesize ratios for packets sent to
loopback device, and reduce memory needs for in-flight loopback packets,
particularly on arches with big pages.
If a sender sends many 1-byte packets to an unresponsive application,
receiver rmem_alloc will grow faster and will stop queuing these packets
sooner, or will collapse its receive queue to free excess memory.
netperf -t TCP_RR results are improved by ~4 %, and many workloads are
improved as well (tbench, mysql...)
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC] tcp: use order-3 pages in tcp_sendmsg()
2012-09-19 15:14 ` Eric Dumazet
2012-09-19 17:28 ` Rick Jones
@ 2012-09-19 17:56 ` David Miller
2012-09-19 19:04 ` Alexander Duyck
2012-09-19 20:18 ` Ben Hutchings
2012-09-19 22:20 ` Vijay Subramanian
3 siblings, 1 reply; 37+ messages in thread
From: David Miller @ 2012-09-19 17:56 UTC (permalink / raw)
To: eric.dumazet; +Cc: netdev
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 19 Sep 2012 17:14:19 +0200
> I did some tests and got no problem so far, even using splice() [ this
> one was tricky because it only deals with order-0 pages at this moment ]
>
> NIC tested : ixgbe, igb, bnx2x, tg3, mellanox mlx4
>
> On loopback, performance of netperf goes from 31900 Mb/s to 38500 Mb/s,
> thats a 20 % increase.
That's really a lot more than I expected, nice.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC] tcp: use order-3 pages in tcp_sendmsg()
2012-09-19 17:56 ` David Miller
@ 2012-09-19 19:04 ` Alexander Duyck
0 siblings, 0 replies; 37+ messages in thread
From: Alexander Duyck @ 2012-09-19 19:04 UTC (permalink / raw)
To: David Miller; +Cc: eric.dumazet, netdev
On 09/19/2012 10:56 AM, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Wed, 19 Sep 2012 17:14:19 +0200
>
>> I did some tests and got no problem so far, even using splice() [ this
>> one was tricky because it only deals with order-0 pages at this moment ]
>>
>> NIC tested : ixgbe, igb, bnx2x, tg3, mellanox mlx4
>>
>> On loopback, performance of netperf goes from 31900 Mb/s to 38500 Mb/s,
>> thats a 20 % increase.
> That's really a lot more than I expected, nice.
When I get some time I will test this patch on a system with an iommu
enabled. I suspect it will have a huge performance impact there since
now you would be looking at roughly 1/8th the total number of map/unmap
calls on a system with 4K pages.
Thanks,
Alex
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC] tcp: use order-3 pages in tcp_sendmsg()
2012-09-19 15:14 ` Eric Dumazet
2012-09-19 17:28 ` Rick Jones
2012-09-19 17:56 ` David Miller
@ 2012-09-19 20:18 ` Ben Hutchings
2012-09-19 22:20 ` Vijay Subramanian
3 siblings, 0 replies; 37+ messages in thread
From: Ben Hutchings @ 2012-09-19 20:18 UTC (permalink / raw)
To: Eric Dumazet; +Cc: David Miller, netdev
On Wed, 2012-09-19 at 17:14 +0200, Eric Dumazet wrote:
> On Mon, 2012-09-17 at 13:07 -0400, David Miller wrote:
> > From: Eric Dumazet <eric.dumazet@gmail.com>
> > Date: Mon, 17 Sep 2012 19:04:53 +0200
> >
> > > On Mon, 2012-09-17 at 19:02 +0200, Eric Dumazet wrote:
> > >
> > >> A driver already exports a dev->gso_max_size, dev->gso_max_segs, I guess
> > >> it could export a dev->max_seg_order (default to 0)
> > >
> > > Oh well, if we use a per thread order-3 page, a driver wont define an
> > > order, but the max size of a segment (dev->max_seg_size).
> >
> > Since you said that your audit showed that most can handle arbitrary
> > segment sizes, it's better to default to infinity or similar.
> >
> > Otherwise we'll have to annotate almost every single driver with a
> > non-zero value, that's not an efficient way to handle this and
> > deploy the higher performance quickly.
>
> I did some tests and got no problem so far, even using splice() [ this
> one was tricky because it only deals with order-0 pages at this moment ]
>
> NIC tested : ixgbe, igb, bnx2x, tg3, mellanox mlx4
I think sfc would also be fine with this; we split at 4K boundaries
regardless of the host page size.
My only concern is fragmentation on busy machines making high-order
allocations more prone to failure (though this change might well slow
that fragmentation). The larger allocation size should at least be made
dependent on (sk->sk_allocation & GFP_KERNEL) == GPF_KERNEL. (Even
then, I've seen some stress test failures where ring reallocation
(similar size, GFP_KERNEL) fails. But those were done with an older
kernel version and the current mm should do better.)
Ben.
> On loopback, performance of netperf goes from 31900 Mb/s to 38500 Mb/s,
> thats a 20 % increase.
--
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC] tcp: use order-3 pages in tcp_sendmsg()
2012-09-19 15:14 ` Eric Dumazet
` (2 preceding siblings ...)
2012-09-19 20:18 ` Ben Hutchings
@ 2012-09-19 22:20 ` Vijay Subramanian
2012-09-20 5:37 ` Eric Dumazet
3 siblings, 1 reply; 37+ messages in thread
From: Vijay Subramanian @ 2012-09-19 22:20 UTC (permalink / raw)
To: Eric Dumazet; +Cc: David Miller, netdev
> I did some tests and got no problem so far, even using splice() [ this
> one was tricky because it only deals with order-0 pages at this moment ]
>
> NIC tested : ixgbe, igb, bnx2x, tg3, mellanox mlx4
I applied this patch to net-next and tested with e1000e driver.
With iperf I got around 8 % improvement on loopback.
Tested-by: Vijay Subramanian <subramanian.vijay@gmail.com>
Vijay
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC] tcp: use order-3 pages in tcp_sendmsg()
2012-09-19 22:20 ` Vijay Subramanian
@ 2012-09-20 5:37 ` Eric Dumazet
2012-09-20 17:10 ` Rick Jones
` (2 more replies)
0 siblings, 3 replies; 37+ messages in thread
From: Eric Dumazet @ 2012-09-20 5:37 UTC (permalink / raw)
To: Vijay Subramanian; +Cc: David Miller, netdev
On Wed, 2012-09-19 at 15:20 -0700, Vijay Subramanian wrote:
> > I did some tests and got no problem so far, even using splice() [ this
> > one was tricky because it only deals with order-0 pages at this moment ]
> >
> > NIC tested : ixgbe, igb, bnx2x, tg3, mellanox mlx4
>
>
> I applied this patch to net-next and tested with e1000e driver.
> With iperf I got around 8 % improvement on loopback.
>
> Tested-by: Vijay Subramanian <subramanian.vijay@gmail.com>
>
>
> Vijay
If you keep the producer and consumer on separate cpus, and use large
enough send() (64KB or 128KB), gain is more like 15 or 20%
iperf uses 8KB writes, while netperf uses a 16KB default.
TCP stack has a problem because /proc/sys/net/ipv4/tcp_reordering
default value (3) is too small for loopback, since a packet contains 4
MSS
A single reorder and some packets are retransmitted.
Following setting is better
echo 16 >/proc/sys/net/ipv4/tcp_reordering
loopback is lossless, so its always surprising we can have TCP
retransmits on this medium ;)
Thanks
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC] tcp: use order-3 pages in tcp_sendmsg()
2012-09-20 5:37 ` Eric Dumazet
@ 2012-09-20 17:10 ` Rick Jones
2012-09-20 17:43 ` Eric Dumazet
2012-09-20 19:40 ` David Miller
2012-09-20 21:39 ` Vijay Subramanian
2012-09-20 22:01 ` Rick Jones
2 siblings, 2 replies; 37+ messages in thread
From: Rick Jones @ 2012-09-20 17:10 UTC (permalink / raw)
To: Eric Dumazet; +Cc: Vijay Subramanian, David Miller, netdev
On 09/19/2012 10:37 PM, Eric Dumazet wrote:
> loopback is lossless, so its always surprising we can have TCP
> retransmits on this medium ;)
Is it lossless?
raj@tardy:~/netperf2_trunk$ netstat -s | grep pru
19 packets pruned from receive queue because of socket buffer overrun
raj@tardy:~/netperf2_trunk$ src/netperf -t TCP_RR -- -b 256 -D -o
burst_size,local_transport_retrans,remote_transport_retrans
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET
to localhost.localdomain () port 0 AF_INET : nodelay : histogram : demo
: first burst 256
Initial Burst Requests,Local Transport Retransmissions,Remote Transport
Retransmissions
256,151,94
raj@tardy:~/netperf2_trunk$ netstat -s | grep pru 26 packets pruned
from receive queue because of socket buffer overrun
raj@tardy:~/netperf2_trunk$ uname -a
Linux tardy 2.6.38-16-generic #67-Ubuntu SMP Thu Sep 6 17:58:38 UTC 2012
x86_64 x86_64 x86_64 GNU/Linux
Admittedly, my test is on an older kernel, but have things changed in
this regard since then? I had to get a bit more contrived on a later
kernel in a VM (vs what is running directly on my workstation):
raj@tardy-ubuntu-1204:~$ netstat -s | grep -e prune -e retrans 1
segments retransmited
4 packets pruned from receive queue because of socket buffer overrun
1 fast retransmits
raj@tardy-ubuntu-1204:~$ netperf -t TCP_RR -- -b 1024 -D -S 16K -o
local_transport_retrans,remote_transport_retrans
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET
to localhost () port 0 AF_INET : nodelay : demo : first burst 1024
Local Transport Retransmissions,Remote Transport Retransmissions
1,0
raj@tardy-ubuntu-1204:~$ netstat -s | grep -e prune -e retrans 2
segments retransmited
7 packets pruned from receive queue because of socket buffer overrun
2 fast retransmits
raj@tardy-ubuntu-1204:~$ uname -a
Linux tardy-ubuntu-1204 3.6.0-rc3+ #7 SMP Mon Sep 10 14:46:05 PDT 2012
x86_64 x86_64 x86_64 GNU/Linux
rick
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC] tcp: use order-3 pages in tcp_sendmsg()
2012-09-20 17:10 ` Rick Jones
@ 2012-09-20 17:43 ` Eric Dumazet
2012-09-20 18:37 ` Yuchung Cheng
2012-09-20 19:40 ` David Miller
1 sibling, 1 reply; 37+ messages in thread
From: Eric Dumazet @ 2012-09-20 17:43 UTC (permalink / raw)
To: Rick Jones; +Cc: Vijay Subramanian, David Miller, netdev
On Thu, 2012-09-20 at 10:10 -0700, Rick Jones wrote:
> On 09/19/2012 10:37 PM, Eric Dumazet wrote:
> > loopback is lossless, so its always surprising we can have TCP
> > retransmits on this medium ;)
>
> Is it lossless?
>
It is lossless, yes.
But packets can be dropped by TCP stack for various reasons, including
reordering and retransmits.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC] tcp: use order-3 pages in tcp_sendmsg()
2012-09-20 17:43 ` Eric Dumazet
@ 2012-09-20 18:37 ` Yuchung Cheng
0 siblings, 0 replies; 37+ messages in thread
From: Yuchung Cheng @ 2012-09-20 18:37 UTC (permalink / raw)
To: Eric Dumazet; +Cc: Rick Jones, Vijay Subramanian, David Miller, netdev
On Thu, Sep 20, 2012 at 10:43 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Thu, 2012-09-20 at 10:10 -0700, Rick Jones wrote:
>> On 09/19/2012 10:37 PM, Eric Dumazet wrote:
>> > loopback is lossless, so its always surprising we can have TCP
>> > retransmits on this medium ;)
>>
>> Is it lossless?
>>
>
> It is lossless, yes.
>
> But packets can be dropped by TCP stack for various reasons, including
> reordering and retransmits.
I'd recommend checking reordering stats. If it's lose less, set
tp->reordering = 127 for loopback.
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC] tcp: use order-3 pages in tcp_sendmsg()
2012-09-20 17:10 ` Rick Jones
2012-09-20 17:43 ` Eric Dumazet
@ 2012-09-20 19:40 ` David Miller
2012-09-20 20:06 ` Rick Jones
1 sibling, 1 reply; 37+ messages in thread
From: David Miller @ 2012-09-20 19:40 UTC (permalink / raw)
To: rick.jones2; +Cc: eric.dumazet, subramanian.vijay, netdev
From: Rick Jones <rick.jones2@hp.com>
Date: Thu, 20 Sep 2012 10:10:43 -0700
> On 09/19/2012 10:37 PM, Eric Dumazet wrote:
>> loopback is lossless, so its always surprising we can have TCP
>> retransmits on this medium ;)
>
> Is it lossless?
>
> raj@tardy:~/netperf2_trunk$ netstat -s | grep pru
> 19 packets pruned from receive queue because of socket buffer overrun
Those packets are not being dropped by the loopback device.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC] tcp: use order-3 pages in tcp_sendmsg()
2012-09-20 19:40 ` David Miller
@ 2012-09-20 20:06 ` Rick Jones
2012-09-20 20:25 ` Eric Dumazet
0 siblings, 1 reply; 37+ messages in thread
From: Rick Jones @ 2012-09-20 20:06 UTC (permalink / raw)
To: David Miller; +Cc: eric.dumazet, subramanian.vijay, netdev
On 09/20/2012 12:40 PM, David Miller wrote:
> From: Rick Jones <rick.jones2@hp.com>
> Date: Thu, 20 Sep 2012 10:10:43 -0700
>
>> On 09/19/2012 10:37 PM, Eric Dumazet wrote:
>>> loopback is lossless, so its always surprising we can have TCP
>>> retransmits on this medium ;)
>>
>> Is it lossless?
>>
>> raj@tardy:~/netperf2_trunk$ netstat -s | grep pru
>> 19 packets pruned from receive queue because of socket buffer overrun
>
> Those packets are not being dropped by the loopback device.
>
Yes, I was being too fast and loose with my wording, paying more
attention to the netperf tests than the rest of it. While loopback may
be lossless, TCP retransmissions over loopback shouldn't be all *that*
surprising.
rick
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC] tcp: use order-3 pages in tcp_sendmsg()
2012-09-20 20:06 ` Rick Jones
@ 2012-09-20 20:25 ` Eric Dumazet
2012-09-21 15:48 ` Eric Dumazet
0 siblings, 1 reply; 37+ messages in thread
From: Eric Dumazet @ 2012-09-20 20:25 UTC (permalink / raw)
To: Rick Jones; +Cc: David Miller, subramanian.vijay, netdev
On Thu, 2012-09-20 at 13:06 -0700, Rick Jones wrote:
>
> Yes, I was being too fast and loose with my wording, paying more
> attention to the netperf tests than the rest of it. While loopback may
> be lossless, TCP retransmissions over loopback shouldn't be all *that*
> surprising.
Sending perfect packets (large packets) should trigger no retransmits.
In your tests, you send one-byte packets, so obviously the receiver will
drop some of them, because sk_rcvbuf limit (or the backlog limit) is hit
very fast.
(This should be less frequent with TCP coalescing that was recently
introduced : We are able to coalesce about 1600 'one-byte packets' into
a single one.)
netperf -t TCP_STREAM over loopback should not drop packets or
retransmit them.
# netstat -s|grep TCPRcvCoalesce
TCPRcvCoalesce: 0
# netperf -t TCP_RR -- -b 1024 -D -S 16K -o
local_transport_retrans,remote_transport_retrans MIGRATED TCP
REQUEST/RESPONSE TEST from 0.0.0.0 () port 0 AF_INET to localhost ()
port 0 AF_INET : nodelay : first burst 1024
Local Transport Retransmissions,Remote Transport Retransmissions
0,0
# netstat -s|grep TCPRcvCoalesce
TCPRcvCoalesce: 2072191
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC] tcp: use order-3 pages in tcp_sendmsg()
2012-09-20 5:37 ` Eric Dumazet
2012-09-20 17:10 ` Rick Jones
@ 2012-09-20 21:39 ` Vijay Subramanian
2012-09-20 22:01 ` Rick Jones
2 siblings, 0 replies; 37+ messages in thread
From: Vijay Subramanian @ 2012-09-20 21:39 UTC (permalink / raw)
To: Eric Dumazet; +Cc: David Miller, netdev
>>
>> I applied this patch to net-next and tested with e1000e driver.
>> With iperf I got around 8 % improvement on loopback.
>>
>> Tested-by: Vijay Subramanian <subramanian.vijay@gmail.com>
I think this tag should be on the thread with the actual patch. I will
reply to your patch with the Tested-by tag.
Thanks for your tips, Eric. For what its worth, here is what I found.
>
> If you keep the producer and consumer on separate cpus, and use large
> enough send() (64KB or 128KB), gain is more like 15 or 20%
>
Curiously, when I use taskset to run iperf server and client on
different cpus, throughput goes down by half
for both baseline (master branch) and with patch. Is taskset the right
way to test this?
I did notice a change in absolute throughout when I increase the
send() buffer size.
However, both the basline as well the patch showed improvement but the
relative improvement
was still around 8%.
> iperf uses 8KB writes, while netperf uses a 16KB default.
I think iperf has a bug. Both man page and comments in code claim
default buffer size for read/write is 8KB but
actual number seems to be 128KB. I believe the actual default is 128KB
not 8KB (-l option with iperf).
Thanks !
Vijay
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC] tcp: use order-3 pages in tcp_sendmsg()
2012-09-20 5:37 ` Eric Dumazet
2012-09-20 17:10 ` Rick Jones
2012-09-20 21:39 ` Vijay Subramanian
@ 2012-09-20 22:01 ` Rick Jones
2 siblings, 0 replies; 37+ messages in thread
From: Rick Jones @ 2012-09-20 22:01 UTC (permalink / raw)
To: Eric Dumazet; +Cc: Vijay Subramanian, David Miller, netdev
On 09/19/2012 10:37 PM, Eric Dumazet wrote:
> iperf uses 8KB writes, while netperf uses a 16KB default.
For the sake of the archives and posterity, netperf does not have a
"fixed" default send size. The "default" will vary with platform and
platform tuning
What netperf does (for TCP at least) is default the send size to the
value returned after a getsockopt(SO_SNDBUF) issued against the socket
just after it is allocated for the data connection. If the user has
asked for a specific socket buffer size, there will have been a
preceding setsockopt(SO_SNDBUF) call.
So, "by default" under Linux, with no options to set the socket buffer
size, netperf will use 16 KB so long as that is the default (initial)
value for SO_SNDBUF.
The sequence will go something like:
1) create the data socket
2) if user asked to set socket buffer size call setsockopt()
3) call getsockopt()
4) if the user did not specify a send size, use the value returned from
the getsockopt() call
So, if one runs netperf on a platform other than Linux, the "default"
send size may be different. Similarly, if running under linux, but
net.ipv4.tcp_wmwm is tweaked, the "default" send size may be different.
happy benchmarking,
rick jones
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC] tcp: use order-3 pages in tcp_sendmsg()
2012-09-20 20:25 ` Eric Dumazet
@ 2012-09-21 15:48 ` Eric Dumazet
2012-09-21 16:27 ` David Miller
0 siblings, 1 reply; 37+ messages in thread
From: Eric Dumazet @ 2012-09-21 15:48 UTC (permalink / raw)
To: Rick Jones; +Cc: David Miller, subramanian.vijay, netdev
On Thu, 2012-09-20 at 22:25 +0200, Eric Dumazet wrote:
> On Thu, 2012-09-20 at 13:06 -0700, Rick Jones wrote:
>
> >
> > Yes, I was being too fast and loose with my wording, paying more
> > attention to the netperf tests than the rest of it. While loopback may
> > be lossless, TCP retransmissions over loopback shouldn't be all *that*
> > surprising.
>
> Sending perfect packets (large packets) should trigger no retransmits.
By the way, with current MTU of 16436 on loopback, max packet size is
48KB (3 MSS)
Using an mtu of 65536 allows another 25% increase of bulk performance...
(and less potential reordering effects, as a packet contains one MSS
instead of three)
There is probably a reason why lo default MTU is 16436 ?
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC] tcp: use order-3 pages in tcp_sendmsg()
2012-09-21 15:48 ` Eric Dumazet
@ 2012-09-21 16:27 ` David Miller
2012-09-21 16:51 ` Eric Dumazet
2012-09-23 12:47 ` Jan Engelhardt
0 siblings, 2 replies; 37+ messages in thread
From: David Miller @ 2012-09-21 16:27 UTC (permalink / raw)
To: eric.dumazet; +Cc: rick.jones2, subramanian.vijay, netdev
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Fri, 21 Sep 2012 17:48:31 +0200
> There is probably a reason why lo default MTU is 16436 ?
That's what fit into L1 caches back in 1999
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC] tcp: use order-3 pages in tcp_sendmsg()
2012-09-21 16:27 ` David Miller
@ 2012-09-21 16:51 ` Eric Dumazet
2012-09-21 17:04 ` David Miller
2012-09-23 12:47 ` Jan Engelhardt
1 sibling, 1 reply; 37+ messages in thread
From: Eric Dumazet @ 2012-09-21 16:51 UTC (permalink / raw)
To: David Miller; +Cc: rick.jones2, subramanian.vijay, netdev
On Fri, 2012-09-21 at 12:27 -0400, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Fri, 21 Sep 2012 17:48:31 +0200
>
> > There is probably a reason why lo default MTU is 16436 ?
>
> That's what fit into L1 caches back in 1999
I see ;)
Nowadays, we even have the NETIF_F_NOCACHE_COPY flag and
__copy_from_user_nocache()
Hmm, we can not toggle this flag on loopback yet
# ethtool -K lo tx-nocache-copy on
Could not change any device features
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC] tcp: use order-3 pages in tcp_sendmsg()
2012-09-21 16:51 ` Eric Dumazet
@ 2012-09-21 17:04 ` David Miller
2012-09-21 17:11 ` Eric Dumazet
0 siblings, 1 reply; 37+ messages in thread
From: David Miller @ 2012-09-21 17:04 UTC (permalink / raw)
To: eric.dumazet; +Cc: rick.jones2, subramanian.vijay, netdev
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Fri, 21 Sep 2012 18:51:21 +0200
> On Fri, 2012-09-21 at 12:27 -0400, David Miller wrote:
>> From: Eric Dumazet <eric.dumazet@gmail.com>
>> Date: Fri, 21 Sep 2012 17:48:31 +0200
>>
>> > There is probably a reason why lo default MTU is 16436 ?
>>
>> That's what fit into L1 caches back in 1999
>
> I see ;)
>
> Nowadays, we even have the NETIF_F_NOCACHE_COPY flag and
> __copy_from_user_nocache()
>
> Hmm, we can not toggle this flag on loopback yet
>
> # ethtool -K lo tx-nocache-copy on
> Could not change any device features
It's a silly limitation, in net/core/dev.c:
/* Turn on no cache copy if HW is doing checksum */
if (!(dev->flags & IFF_LOOPBACK)) {
dev->hw_features |= NETIF_F_NOCACHE_COPY;
if (dev->features & NETIF_F_ALL_CSUM) {
dev->wanted_features |= NETIF_F_NOCACHE_COPY;
dev->features |= NETIF_F_NOCACHE_COPY;
}
}
Maybe this is probably better done as:
/* Turn on no cache copy if HW is doing checksum */
dev->hw_features |= NETIF_F_NOCACHE_COPY;
if (!(dev->flags & IFF_LOOPBACK)) {
if (dev->features & NETIF_F_ALL_CSUM) {
dev->wanted_features |= NETIF_F_NOCACHE_COPY;
dev->features |= NETIF_F_NOCACHE_COPY;
}
}
And then the code matches more closely the comment. :-)
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC] tcp: use order-3 pages in tcp_sendmsg()
2012-09-21 17:04 ` David Miller
@ 2012-09-21 17:11 ` Eric Dumazet
0 siblings, 0 replies; 37+ messages in thread
From: Eric Dumazet @ 2012-09-21 17:11 UTC (permalink / raw)
To: David Miller; +Cc: rick.jones2, subramanian.vijay, netdev
On Fri, 2012-09-21 at 13:04 -0400, David Miller wrote:
> It's a silly limitation, in net/core/dev.c:
>
> /* Turn on no cache copy if HW is doing checksum */
> if (!(dev->flags & IFF_LOOPBACK)) {
> dev->hw_features |= NETIF_F_NOCACHE_COPY;
> if (dev->features & NETIF_F_ALL_CSUM) {
> dev->wanted_features |= NETIF_F_NOCACHE_COPY;
> dev->features |= NETIF_F_NOCACHE_COPY;
> }
> }
>
> Maybe this is probably better done as:
>
> /* Turn on no cache copy if HW is doing checksum */
> dev->hw_features |= NETIF_F_NOCACHE_COPY;
> if (!(dev->flags & IFF_LOOPBACK)) {
> if (dev->features & NETIF_F_ALL_CSUM) {
> dev->wanted_features |= NETIF_F_NOCACHE_COPY;
> dev->features |= NETIF_F_NOCACHE_COPY;
> }
> }
>
> And then the code matches more closely the comment. :-)
I did a test, and for various combinations (producer/consumer on same
core or not, same cpu or not...) and performance is divided by 2
So I guess we can leave the code as is
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC] tcp: use order-3 pages in tcp_sendmsg()
2012-09-21 16:27 ` David Miller
2012-09-21 16:51 ` Eric Dumazet
@ 2012-09-23 12:47 ` Jan Engelhardt
2012-09-23 16:16 ` David Miller
1 sibling, 1 reply; 37+ messages in thread
From: Jan Engelhardt @ 2012-09-23 12:47 UTC (permalink / raw)
To: David Miller; +Cc: eric.dumazet, rick.jones2, subramanian.vijay, netdev
On Friday 2012-09-21 18:27, David Miller wrote:
>From: Eric Dumazet <eric.dumazet@gmail.com>
>Date: Fri, 21 Sep 2012 17:48:31 +0200
>
>> There is probably a reason why lo default MTU is 16436 ?
>
>That's what fit into L1 caches back in 1999
Would it make sense to automatically set lo's MTU on device registration
to the actual size of the L1 that the $running system has?
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC] tcp: use order-3 pages in tcp_sendmsg()
2012-09-23 12:47 ` Jan Engelhardt
@ 2012-09-23 16:16 ` David Miller
2012-09-23 17:40 ` Jan Engelhardt
0 siblings, 1 reply; 37+ messages in thread
From: David Miller @ 2012-09-23 16:16 UTC (permalink / raw)
To: jengelh; +Cc: eric.dumazet, rick.jones2, subramanian.vijay, netdev
From: Jan Engelhardt <jengelh@inai.de>
Date: Sun, 23 Sep 2012 14:47:28 +0200 (CEST)
> On Friday 2012-09-21 18:27, David Miller wrote:
>
>>From: Eric Dumazet <eric.dumazet@gmail.com>
>>Date: Fri, 21 Sep 2012 17:48:31 +0200
>>
>>> There is probably a reason why lo default MTU is 16436 ?
>>
>>That's what fit into L1 caches back in 1999
>
> Would it make sense to automatically set lo's MTU on device
> registration to the actual size of the L1 that the $running system
> has?
I think a fixed value of 64K would be a easiest, because another issue
is that back in 1999 we didn't have GRO/GSO/etc.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC] tcp: use order-3 pages in tcp_sendmsg()
2012-09-23 16:16 ` David Miller
@ 2012-09-23 17:40 ` Jan Engelhardt
2012-09-23 18:13 ` Eric Dumazet
2012-09-23 18:27 ` David Miller
0 siblings, 2 replies; 37+ messages in thread
From: Jan Engelhardt @ 2012-09-23 17:40 UTC (permalink / raw)
To: David Miller; +Cc: eric.dumazet, rick.jones2, subramanian.vijay, netdev
On Sunday 2012-09-23 18:16, David Miller wrote:
>> On Friday 2012-09-21 18:27, David Miller wrote:
>>
>>>From: Eric Dumazet <eric.dumazet@gmail.com>
>>>Date: Fri, 21 Sep 2012 17:48:31 +0200
>>>
>>>> There is probably a reason why lo default MTU is 16436 ?
>>>
>>>That's what fit into L1 caches back in 1999
>>
>> Would it make sense to automatically set lo's MTU on device
>> registration to the actual size of the L1 that the $running system
>> has?
>
>I think a fixed value of 64K would be a easiest, because another issue
>is that back in 1999 we didn't have GRO/GSO/etc.
Cache sizes, and an oddity.
L1 cache sizes have not increased ever since (the 2011 Intel i7-2600
has the same amount of L1 as a 1998ish AMD K6-2), and the Atom N450
even has less, namely 24d+32i, meaning a 13000ish MTU might be more
accurate for netbooks of this kind.
What would a MTU of 64K buy? Offloading seems pointless for the lo
device. There is no hardware to offload it to - well, the
machine already has the packet too.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC] tcp: use order-3 pages in tcp_sendmsg()
2012-09-23 17:40 ` Jan Engelhardt
@ 2012-09-23 18:13 ` Eric Dumazet
2012-09-23 18:27 ` David Miller
1 sibling, 0 replies; 37+ messages in thread
From: Eric Dumazet @ 2012-09-23 18:13 UTC (permalink / raw)
To: Jan Engelhardt; +Cc: David Miller, rick.jones2, subramanian.vijay, netdev
On Sun, 2012-09-23 at 19:40 +0200, Jan Engelhardt wrote:
>
> Cache sizes, and an oddity.
>
> L1 cache sizes have not increased ever since (the 2011 Intel i7-2600
> has the same amount of L1 as a 1998ish AMD K6-2), and the Atom N450
> even has less, namely 24d+32i, meaning a 13000ish MTU might be more
> accurate for netbooks of this kind.
>
MTU above 64K is not really useful, at least for IPv4, I think.
> What would a MTU of 64K buy? Offloading seems pointless for the lo
> device. There is no hardware to offload it to - well, the
> machine already has the packet too.
It buys 25% increase of performance. Not too bad...
I yet have to make sure no adjustment is needed, in TCP stack for
example.
If you try to change lo mtu on an old kernel (eg 2.6.38, on my laptop),
you can notice some packets drops (TCPBacklogDrop)
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC] tcp: use order-3 pages in tcp_sendmsg()
2012-09-23 17:40 ` Jan Engelhardt
2012-09-23 18:13 ` Eric Dumazet
@ 2012-09-23 18:27 ` David Miller
1 sibling, 0 replies; 37+ messages in thread
From: David Miller @ 2012-09-23 18:27 UTC (permalink / raw)
To: jengelh; +Cc: eric.dumazet, rick.jones2, subramanian.vijay, netdev
From: Jan Engelhardt <jengelh@inai.de>
Date: Sun, 23 Sep 2012 19:40:33 +0200 (CEST)
> What would a MTU of 64K buy? Offloading seems pointless for the lo
> device. There is no hardware to offload it to - well, the
> machine already has the packet too.
Traversing the stack once instead of N times.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC] tcp: use order-3 pages in tcp_sendmsg()
2012-09-17 7:49 [RFC] tcp: use order-3 pages in tcp_sendmsg() Eric Dumazet
2012-09-17 16:12 ` David Miller
@ 2012-11-15 7:52 ` Yan, Zheng
2012-11-15 13:06 ` Eric Dumazet
` (2 more replies)
1 sibling, 3 replies; 37+ messages in thread
From: Yan, Zheng @ 2012-11-15 7:52 UTC (permalink / raw)
To: Eric Dumazet; +Cc: netdev
On Mon, Sep 17, 2012 at 3:49 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> We currently use per socket page reserve for tcp_sendmsg() operations.
>
> Its done to raise the probability of coalescing small write() in to
> single segments in the skbs.
>
> But it wastes a lot of memory for applications handling a lot of mostly
> idle sockets, since each socket holds one page in sk->sk_sndmsg_page
>
> I did a small experiment to use order-3 pages and it gave me a 10% boost
> of performance, because each TSO skb can use only two frags of 32KB,
> instead of 16 frags of 4KB, so we spend less time in ndo_start_xmit() to
> setup the tx descriptor and TX completion path to unmap the frags and
> free them.
>
> We also spend less time in tcp_sendmsg(), because we call page allocator
> 8x less often.
>
> Now back to the per socket page, what about trying to factorize it ?
>
> Since we can sleep (or/and do a cpu migration) in tcp_sendmsg(), we cant
> really use a percpu page reserve as we do in __netdev_alloc_frag()
>
> We could instead use a per thread reserve, at the cost of adding a test
> in task exit handler.
>
> Recap :
>
> 1) Use a per thread page reserve instead of a per socket one
> 2) Use order-3 pages (or order-0 pages if page size is >= 32768)
>
>
Hi,
This commit makes one of our test case on core 2 machine drop in performance
by about 60%. The test case runs 2048 instances of netperf 64k stream test at
the same time. Analysis showed using order-3 pages causes more LLC misses,
most new LLC misses happen when the senders copy data to the socket buffer.
If revert to use single page, the sender side only trigger a few LLC
misses, most
LLC misses happen on the receiver size. It means most pages allocated by the
senders are cache hot. But when using order-3 pages, 2048 * 32k = 64M, 64M
is much larger than LLC size. Should this regression be worried? or
our test case
is too unpractical?
Regards
Yan, Zheng
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC] tcp: use order-3 pages in tcp_sendmsg()
2012-11-15 7:52 ` Yan, Zheng
@ 2012-11-15 13:06 ` Eric Dumazet
2012-11-16 2:36 ` Yan, Zheng
2012-11-15 13:47 ` Eric Dumazet
2012-11-15 18:33 ` Rick Jones
2 siblings, 1 reply; 37+ messages in thread
From: Eric Dumazet @ 2012-11-15 13:06 UTC (permalink / raw)
To: Yan, Zheng; +Cc: netdev
On Thu, 2012-11-15 at 15:52 +0800, Yan, Zheng wrote:
> On Mon, Sep 17, 2012 at 3:49 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > We currently use per socket page reserve for tcp_sendmsg() operations.
> >
> > Its done to raise the probability of coalescing small write() in to
> > single segments in the skbs.
> >
> > But it wastes a lot of memory for applications handling a lot of mostly
> > idle sockets, since each socket holds one page in sk->sk_sndmsg_page
> >
> > I did a small experiment to use order-3 pages and it gave me a 10% boost
> > of performance, because each TSO skb can use only two frags of 32KB,
> > instead of 16 frags of 4KB, so we spend less time in ndo_start_xmit() to
> > setup the tx descriptor and TX completion path to unmap the frags and
> > free them.
> >
> > We also spend less time in tcp_sendmsg(), because we call page allocator
> > 8x less often.
> >
> > Now back to the per socket page, what about trying to factorize it ?
> >
> > Since we can sleep (or/and do a cpu migration) in tcp_sendmsg(), we cant
> > really use a percpu page reserve as we do in __netdev_alloc_frag()
> >
> > We could instead use a per thread reserve, at the cost of adding a test
> > in task exit handler.
> >
> > Recap :
> >
> > 1) Use a per thread page reserve instead of a per socket one
> > 2) Use order-3 pages (or order-0 pages if page size is >= 32768)
> >
> >
>
> Hi,
>
> This commit makes one of our test case on core 2 machine drop in performance
> by about 60%. The test case runs 2048 instances of netperf 64k stream test at
> the same time. Analysis showed using order-3 pages causes more LLC misses,
> most new LLC misses happen when the senders copy data to the socket buffer.
> If revert to use single page, the sender side only trigger a few LLC
> misses, most
> LLC misses happen on the receiver size. It means most pages allocated by the
> senders are cache hot. But when using order-3 pages, 2048 * 32k = 64M, 64M
> is much larger than LLC size. Should this regression be worried? or
> our test case
> is too unpractical?
Hi Yan
You forgot to give some basic information with this mail, like the
hardware configuration, NIC driver, ...
Increasing performance can sometime change the balance you had on a
prior workload.
Number of in flight bytes do not depend on the order of the pages, but
sizes of TCP buffers (receiver, sender)
TCP Small queue was an attempt to reduce the number of in-flight bytes,
you should try to change either SO_SNDBUF or SO_RCVBUF settings (instead
of letting the system autotune them) if you really need 2048 concurrent
flows.
Otherwise, each flow can consume up to 6 MB of memory, so obviously your
cpu caches wont hold 2048*6MB of memory...
If the sender is faster (because of this commit), but receiver is slow
to drain the receive queues, then you can have a situation where the
consumed memory on receiver is higher and the receiver might be actually
slower.
Thanks
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC] tcp: use order-3 pages in tcp_sendmsg()
2012-11-15 7:52 ` Yan, Zheng
2012-11-15 13:06 ` Eric Dumazet
@ 2012-11-15 13:47 ` Eric Dumazet
2012-11-21 8:05 ` Yan, Zheng
2012-11-15 18:33 ` Rick Jones
2 siblings, 1 reply; 37+ messages in thread
From: Eric Dumazet @ 2012-11-15 13:47 UTC (permalink / raw)
To: Yan, Zheng; +Cc: netdev
On Thu, 2012-11-15 at 15:52 +0800, Yan, Zheng wrote:
> LLC misses happen on the receiver size. It means most pages allocated by the
> senders are cache hot. But when using order-3 pages, 2048 * 32k = 64M, 64M
> is much larger than LLC size.
By the way, this 2048*32k is wrong, as the receiver only uses fragments
in pages, and its not related to the "tcp: use order-3 pages in
tcp_sendmsg()" commit
Many drivers use order-0 pages to hold ethernet frames, regardless of
what was used by the sender ;)
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC] tcp: use order-3 pages in tcp_sendmsg()
2012-11-15 7:52 ` Yan, Zheng
2012-11-15 13:06 ` Eric Dumazet
2012-11-15 13:47 ` Eric Dumazet
@ 2012-11-15 18:33 ` Rick Jones
2 siblings, 0 replies; 37+ messages in thread
From: Rick Jones @ 2012-11-15 18:33 UTC (permalink / raw)
To: Yan, Zheng ; +Cc: Eric Dumazet, netdev
On 11/14/2012 11:52 PM, Yan, Zheng wrote:
> This commit makes one of our test case on core 2 machine drop in
> performance by about 60%. The test case runs 2048 instances of
> netperf 64k stream test at the same time.
I'm impressed that 2048 concurrent netperf TCP_STREAM tests ran to
completion in the first place :)
> Analysis showed using order-3 pages causes more LLC misses, most new
> LLC misses happen when the senders copy data to the socket buffer.
> If revert to use single page, the sender side only trigger a few LLC
> misses, most LLC misses happen on the receiver size. It means most
> pages allocated by the senders are cache hot. But when using order-3
> pages, 2048 * 32k = 64M, 64M is much larger than LLC size. Should
> this regression be worried? or our test case is too unpractical?
Even before the page change I would have expected the buffers that
netperf itself uses would have exceeded the LLC. If you were not using
test-specific -s and -S options to set an explicit socket buffer size, I
believe that under Linux (most of the time) the default SO_SNDBUF size
will be 86KB. Coupled with your statement that the send size was 64K it
means the send ring being used by netperf will be 2, 64KB buffers, which
would then be 256MB across 2048 concurrent netperfs. Even if we go with
"only the one send buffer in play at a time matters" that is still 128
MB of space up in netperf itself even before one gets to the stack.
Still, sharing the analysis tool output might be helpful.
By the way the "default" size of the buffer netperf posts in recv()
calls will depend on the initial value of SO_RCVBUF after the data
socket is created and had any -s or -S option values applied to it.
I cannot say that the scripts distributed with netperf are consistently
good about doing it themselves, but I would suggest for the "canonical"
bulk streak test something like:
netperf -t TCP_STREAM -H <dest> -l 60 -- -s 1M -S 1M -m 64K -M 64K
as that will reduce the number of variables. Those -s and -S values
though will probably call for tweaking sysctl settings or they will be
clipped by net.core.rmem_max and net.core.wmem_max. At a minimum I
would suggest having the -m and -M option. I might also tack-on a "-o
all" at the end, but that is a matter of preference - it will cause a
great deal of output...
Eric Dumazet later says:
> Number of in flight bytes do not depend on the order of the pages, but
> sizes of TCP buffers (receiver, sender)
And unless you happened to use explicit -s and -S options, there is even
more variability in how much may be inflight. If you do not add those
you can at least get netperf to report what the socket buffer sizes
became by the end of the test:
netperf -t TCP_STREAM ... -- ... -o lss_size_end,rsr_size_end
for "local socket send size" and "remote socket receive size" respectively.
> If the sender is faster (because of this commit), but receiver is slow
> to drain the receive queues, then you can have a situation where the
> consumed memory on receiver is higher and the receiver might be actually
> slower.
Netperf can be told to report the number of receive calls and the bytes
per receive - either by tacking-on a global "-v 2" or by requesting them
explicitly via omni output selection. Presumably, if the receiving
netserver processes are not keeping-up as well, that should manifest as
the bytes per receive being larger in the "after" case than the "before"
case.
netperf ... -- ... -o
remote_recv_size,remote_recv_calls,remote_bytes_per_recv
happy benchmarking,
rick jones
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC] tcp: use order-3 pages in tcp_sendmsg()
2012-11-15 13:06 ` Eric Dumazet
@ 2012-11-16 2:36 ` Yan, Zheng
0 siblings, 0 replies; 37+ messages in thread
From: Yan, Zheng @ 2012-11-16 2:36 UTC (permalink / raw)
To: Eric Dumazet; +Cc: netdev, rick.jones2, Yan, Zheng
On Thu, Nov 15, 2012 at 9:06 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Thu, 2012-11-15 at 15:52 +0800, Yan, Zheng wrote:
>> On Mon, Sep 17, 2012 at 3:49 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> > We currently use per socket page reserve for tcp_sendmsg() operations.
>> >
>> > Its done to raise the probability of coalescing small write() in to
>> > single segments in the skbs.
>> >
>> > But it wastes a lot of memory for applications handling a lot of mostly
>> > idle sockets, since each socket holds one page in sk->sk_sndmsg_page
>> >
>> > I did a small experiment to use order-3 pages and it gave me a 10% boost
>> > of performance, because each TSO skb can use only two frags of 32KB,
>> > instead of 16 frags of 4KB, so we spend less time in ndo_start_xmit() to
>> > setup the tx descriptor and TX completion path to unmap the frags and
>> > free them.
>> >
>> > We also spend less time in tcp_sendmsg(), because we call page allocator
>> > 8x less often.
>> >
>> > Now back to the per socket page, what about trying to factorize it ?
>> >
>> > Since we can sleep (or/and do a cpu migration) in tcp_sendmsg(), we cant
>> > really use a percpu page reserve as we do in __netdev_alloc_frag()
>> >
>> > We could instead use a per thread reserve, at the cost of adding a test
>> > in task exit handler.
>> >
>> > Recap :
>> >
>> > 1) Use a per thread page reserve instead of a per socket one
>> > 2) Use order-3 pages (or order-0 pages if page size is >= 32768)
>> >
>> >
>>
>> Hi,
>>
>> This commit makes one of our test case on core 2 machine drop in performance
>> by about 60%. The test case runs 2048 instances of netperf 64k stream test at
>> the same time. Analysis showed using order-3 pages causes more LLC misses,
>> most new LLC misses happen when the senders copy data to the socket buffer.
>> If revert to use single page, the sender side only trigger a few LLC
>> misses, most
>> LLC misses happen on the receiver size. It means most pages allocated by the
>> senders are cache hot. But when using order-3 pages, 2048 * 32k = 64M, 64M
>> is much larger than LLC size. Should this regression be worried? or
>> our test case
>> is too unpractical?
>
> Hi Yan
>
> You forgot to give some basic information with this mail, like the
> hardware configuration, NIC driver, ...
>
> Increasing performance can sometime change the balance you had on a
> prior workload.
>
> Number of in flight bytes do not depend on the order of the pages, but
> sizes of TCP buffers (receiver, sender)
>
> TCP Small queue was an attempt to reduce the number of in-flight bytes,
> you should try to change either SO_SNDBUF or SO_RCVBUF settings (instead
> of letting the system autotune them) if you really need 2048 concurrent
> flows.
>
> Otherwise, each flow can consume up to 6 MB of memory, so obviously your
> cpu caches wont hold 2048*6MB of memory...
>
> If the sender is faster (because of this commit), but receiver is slow
> to drain the receive queues, then you can have a situation where the
> consumed memory on receiver is higher and the receiver might be actually
> slower.
>
I'm sorry, I forgot to mention the test ran on loopback device. It's
one test case in
our kernel performance test project. This test case is very sensitive to memory
allocation and scheduler behavior changes.
Regards
Yan, Zheng
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC] tcp: use order-3 pages in tcp_sendmsg()
2012-11-15 13:47 ` Eric Dumazet
@ 2012-11-21 8:05 ` Yan, Zheng
0 siblings, 0 replies; 37+ messages in thread
From: Yan, Zheng @ 2012-11-21 8:05 UTC (permalink / raw)
To: Eric Dumazet; +Cc: netdev
On Thu, Nov 15, 2012 at 9:47 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Thu, 2012-11-15 at 15:52 +0800, Yan, Zheng wrote:
>> LLC misses happen on the receiver size. It means most pages allocated by the
>> senders are cache hot. But when using order-3 pages, 2048 * 32k = 64M, 64M
>> is much larger than LLC size.
>
> By the way, this 2048*32k is wrong, as the receiver only uses fragments
> in pages, and its not related to the "tcp: use order-3 pages in
> tcp_sendmsg()" commit
>
> Many drivers use order-0 pages to hold ethernet frames, regardless of
> what was used by the sender ;)
>
>
Hi,
I think we found root cause of this regression. The test case runs
2048 instance of
netperf TCP loopback stream test on a two sockets core2 machine. There is more
LLC misses when using order-3 pages. core2 is not NUMA architecture, there is
only one memory node. Order-3 pages used by one socket may be later re-used by
another socket, which causes lots of LLC invalidation. Using order-0
page doesn't
have this issue is because the kernel page allocator uses per-cpu list
to optimize
order-0 page allocation.
Regards
Yan, Zheng
^ permalink raw reply [flat|nested] 37+ messages in thread
end of thread, other threads:[~2012-11-21 8:05 UTC | newest]
Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-09-17 7:49 [RFC] tcp: use order-3 pages in tcp_sendmsg() Eric Dumazet
2012-09-17 16:12 ` David Miller
2012-09-17 17:02 ` Eric Dumazet
2012-09-17 17:04 ` Eric Dumazet
2012-09-17 17:07 ` David Miller
2012-09-19 15:14 ` Eric Dumazet
2012-09-19 17:28 ` Rick Jones
2012-09-19 17:55 ` Eric Dumazet
2012-09-19 17:56 ` David Miller
2012-09-19 19:04 ` Alexander Duyck
2012-09-19 20:18 ` Ben Hutchings
2012-09-19 22:20 ` Vijay Subramanian
2012-09-20 5:37 ` Eric Dumazet
2012-09-20 17:10 ` Rick Jones
2012-09-20 17:43 ` Eric Dumazet
2012-09-20 18:37 ` Yuchung Cheng
2012-09-20 19:40 ` David Miller
2012-09-20 20:06 ` Rick Jones
2012-09-20 20:25 ` Eric Dumazet
2012-09-21 15:48 ` Eric Dumazet
2012-09-21 16:27 ` David Miller
2012-09-21 16:51 ` Eric Dumazet
2012-09-21 17:04 ` David Miller
2012-09-21 17:11 ` Eric Dumazet
2012-09-23 12:47 ` Jan Engelhardt
2012-09-23 16:16 ` David Miller
2012-09-23 17:40 ` Jan Engelhardt
2012-09-23 18:13 ` Eric Dumazet
2012-09-23 18:27 ` David Miller
2012-09-20 21:39 ` Vijay Subramanian
2012-09-20 22:01 ` Rick Jones
2012-11-15 7:52 ` Yan, Zheng
2012-11-15 13:06 ` Eric Dumazet
2012-11-16 2:36 ` Yan, Zheng
2012-11-15 13:47 ` Eric Dumazet
2012-11-21 8:05 ` Yan, Zheng
2012-11-15 18:33 ` Rick Jones
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).