* [PATCH net-next 0/3] v3 GRE: TCP segmentation offload
@ 2013-02-14 19:44 Pravin B Shelar
2013-02-15 20:18 ` David Miller
0 siblings, 1 reply; 6+ messages in thread
From: Pravin B Shelar @ 2013-02-14 19:44 UTC (permalink / raw)
To: netdev; +Cc: edumazet, jesse, bhutchings, mirqus, Pravin B Shelar
Following patches add TCP segmentation offload to GRE. These
patches shows 20-25% performance improvement in netperf single
process TCP_STREAM test on 10G network.
Pravin B Shelar (3):
net: Add skb_unclone() helper function.
net: factor out skb_mac_gso_segment() from skb_gso_segment()
GRE: Add TCP segmentation offload for GRE
drivers/net/ppp/ppp_generic.c | 3 +-
include/linux/netdev_features.h | 3 +-
include/linux/netdevice.h | 2 +
include/linux/skbuff.h | 27 +++++++
net/core/dev.c | 80 ++++++++++++--------
net/core/ethtool.c | 1 +
net/core/skbuff.c | 6 +-
net/ipv4/af_inet.c | 1 +
net/ipv4/ah4.c | 3 +-
net/ipv4/gre.c | 122 +++++++++++++++++++++++++++++++
net/ipv4/ip_fragment.c | 2 +-
net/ipv4/ip_gre.c | 82 +++++++++++++++++++--
net/ipv4/tcp.c | 1 +
net/ipv4/tcp_output.c | 2 +-
net/ipv4/udp.c | 3 +-
net/ipv4/xfrm4_input.c | 2 +-
net/ipv4/xfrm4_mode_tunnel.c | 3 +-
net/ipv6/ah6.c | 3 +-
net/ipv6/ip6_offload.c | 1 +
net/ipv6/netfilter/nf_conntrack_reasm.c | 2 +-
net/ipv6/reassembly.c | 2 +-
net/ipv6/udp_offload.c | 3 +-
net/ipv6/xfrm6_mode_tunnel.c | 3 +-
net/sched/act_ipt.c | 6 +-
net/sched/act_pedit.c | 3 +-
25 files changed, 303 insertions(+), 63 deletions(-)
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH net-next 0/3] v3 GRE: TCP segmentation offload
2013-02-14 19:44 [PATCH net-next 0/3] v3 GRE: TCP segmentation offload Pravin B Shelar
@ 2013-02-15 20:18 ` David Miller
2013-02-16 0:52 ` Eric Dumazet
0 siblings, 1 reply; 6+ messages in thread
From: David Miller @ 2013-02-15 20:18 UTC (permalink / raw)
To: pshelar; +Cc: netdev, edumazet, jesse, bhutchings, mirqus
From: Pravin B Shelar <pshelar@nicira.com>
Date: Thu, 14 Feb 2013 11:44:41 -0800
> Following patches add TCP segmentation offload to GRE. These
> patches shows 20-25% performance improvement in netperf single
> process TCP_STREAM test on 10G network.
>
> Pravin B Shelar (3):
> net: Add skb_unclone() helper function.
> net: factor out skb_mac_gso_segment() from skb_gso_segment()
> GRE: Add TCP segmentation offload for GRE
All applied, incorporating the suggestions/fixes from Eric. Specifically,
using skb_reset_mac_len() in patch #2 and computing pkt_len before ip_local_out()
in patch #3.
Thanks.
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH net-next 0/3] v3 GRE: TCP segmentation offload
2013-02-15 20:18 ` David Miller
@ 2013-02-16 0:52 ` Eric Dumazet
2013-02-16 1:41 ` Pravin Shelar
2013-02-16 1:53 ` David Miller
0 siblings, 2 replies; 6+ messages in thread
From: Eric Dumazet @ 2013-02-16 0:52 UTC (permalink / raw)
To: David Miller; +Cc: pshelar, netdev, edumazet, jesse, bhutchings, mirqus
On Fri, 2013-02-15 at 15:18 -0500, David Miller wrote:
> All applied, incorporating the suggestions/fixes from Eric. Specifically,
> using skb_reset_mac_len() in patch #2 and computing pkt_len before ip_local_out()
> in patch #3.
Thanks David
There is this "tx-nocache-copy" issue :
We currently enable the nocache copy for all devices but loopback.
But its a loss of performance with tunnel devices
Actually, it seems a loss even for regular ethernet devices :(
# ethtool -K gre1 tx-nocache-copy on
# perf stat netperf -H 7.7.8.84
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.8.84 () port 0 AF_INET
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/sec
87380 16384 16384 10.00 4252.42
Performance counter stats for 'netperf -H 7.7.8.84':
9967.965824 task-clock # 0.996 CPUs utilized
54 context-switches # 0.005 K/sec
3 CPU-migrations # 0.000 K/sec
261 page-faults # 0.026 K/sec
27,964,187,393 cycles # 2.805 GHz
20,902,040,632 stalled-cycles-frontend # 74.75% frontend cycles idle
13,524,565,776 stalled-cycles-backend # 48.36% backend cycles idle
15,929,463,578 instructions # 0.57 insns per cycle
# 1.31 stalled cycles per insn
2,065,830,063 branches # 207.247 M/sec
35,891,035 branch-misses # 1.74% of all branches
10.003882959 seconds time elapsed
Now we use regular memory copy :
# ethtool -K gre1 tx-nocache-copy off
# perf stat netperf -H 7.7.8.84
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.8.84 () port 0 AF_INET
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/sec
87380 16384 16384 10.00 7706.50
Performance counter stats for 'netperf -H 7.7.8.84':
5708.284991 task-clock # 0.571 CPUs utilized
5,138 context-switches # 0.900 K/sec
24 CPU-migrations # 0.004 K/sec
260 page-faults # 0.046 K/sec
15,990,404,388 cycles # 2.801 GHz
10,903,764,099 stalled-cycles-frontend # 68.19% frontend cycles idle
6,089,332,139 stalled-cycles-backend # 38.08% backend cycles idle
10,680,845,426 instructions # 0.67 insns per cycle
# 1.02 stalled cycles per insn
1,401,663,288 branches # 245.549 M/sec
15,380,428 branch-misses # 1.10% of all branches
10.004025020 seconds time elapsed
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH net-next 0/3] v3 GRE: TCP segmentation offload
2013-02-16 0:52 ` Eric Dumazet
@ 2013-02-16 1:41 ` Pravin Shelar
2013-02-16 1:43 ` Eric Dumazet
2013-02-16 1:53 ` David Miller
1 sibling, 1 reply; 6+ messages in thread
From: Pravin Shelar @ 2013-02-16 1:41 UTC (permalink / raw)
To: Eric Dumazet; +Cc: David Miller, netdev, edumazet, jesse, bhutchings, mirqus
On Fri, Feb 15, 2013 at 4:52 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Fri, 2013-02-15 at 15:18 -0500, David Miller wrote:
>
>> All applied, incorporating the suggestions/fixes from Eric. Specifically,
>> using skb_reset_mac_len() in patch #2 and computing pkt_len before ip_local_out()
>> in patch #3.
>
> Thanks David
>
> There is this "tx-nocache-copy" issue :
>
> We currently enable the nocache copy for all devices but loopback.
>
> But its a loss of performance with tunnel devices
>
> Actually, it seems a loss even for regular ethernet devices :(
>
>
>
> # ethtool -K gre1 tx-nocache-copy on
> # perf stat netperf -H 7.7.8.84
> MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.8.84 () port 0 AF_INET
> Recv Send Send
> Socket Socket Message Elapsed
> Size Size Size Time Throughput
> bytes bytes bytes secs. 10^6bits/sec
>
> 87380 16384 16384 10.00 4252.42
>
> Performance counter stats for 'netperf -H 7.7.8.84':
>
> 9967.965824 task-clock # 0.996 CPUs utilized
> 54 context-switches # 0.005 K/sec
> 3 CPU-migrations # 0.000 K/sec
> 261 page-faults # 0.026 K/sec
> 27,964,187,393 cycles # 2.805 GHz
> 20,902,040,632 stalled-cycles-frontend # 74.75% frontend cycles idle
> 13,524,565,776 stalled-cycles-backend # 48.36% backend cycles idle
> 15,929,463,578 instructions # 0.57 insns per cycle
> # 1.31 stalled cycles per insn
> 2,065,830,063 branches # 207.247 M/sec
> 35,891,035 branch-misses # 1.74% of all branches
>
> 10.003882959 seconds time elapsed
>
>
> Now we use regular memory copy :
>
> # ethtool -K gre1 tx-nocache-copy off
> # perf stat netperf -H 7.7.8.84
> MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.8.84 () port 0 AF_INET
> Recv Send Send
> Socket Socket Message Elapsed
> Size Size Size Time Throughput
> bytes bytes bytes secs. 10^6bits/sec
>
> 87380 16384 16384 10.00 7706.50
>
> Performance counter stats for 'netperf -H 7.7.8.84':
>
> 5708.284991 task-clock # 0.571 CPUs utilized
> 5,138 context-switches # 0.900 K/sec
> 24 CPU-migrations # 0.004 K/sec
> 260 page-faults # 0.046 K/sec
> 15,990,404,388 cycles # 2.801 GHz
> 10,903,764,099 stalled-cycles-frontend # 68.19% frontend cycles idle
> 6,089,332,139 stalled-cycles-backend # 38.08% backend cycles idle
> 10,680,845,426 instructions # 0.67 insns per cycle
> # 1.02 stalled cycles per insn
> 1,401,663,288 branches # 245.549 M/sec
> 15,380,428 branch-misses # 1.10% of all branches
>
> 10.004025020 seconds time elapsed
>
>
I am not seeing such big difference with these setting, are you
running this test on special hardware or in VM?
Thanks,
Pravin.
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH net-next 0/3] v3 GRE: TCP segmentation offload
2013-02-16 1:41 ` Pravin Shelar
@ 2013-02-16 1:43 ` Eric Dumazet
0 siblings, 0 replies; 6+ messages in thread
From: Eric Dumazet @ 2013-02-16 1:43 UTC (permalink / raw)
To: Pravin Shelar; +Cc: David Miller, netdev, edumazet, jesse, bhutchings, mirqus
On Fri, 2013-02-15 at 17:41 -0800, Pravin Shelar wrote:
>
> I am not seeing such big difference with these setting, are you
> running this test on special hardware or in VM?
Thats bare metal actually...
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 44
model name : Intel(R) Xeon(R) CPU X5660 @ 2.80GHz
stepping : 2
microcode : 0x13
cpu MHz : 2800.330
cache size : 12288 KB
...
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH net-next 0/3] v3 GRE: TCP segmentation offload
2013-02-16 0:52 ` Eric Dumazet
2013-02-16 1:41 ` Pravin Shelar
@ 2013-02-16 1:53 ` David Miller
1 sibling, 0 replies; 6+ messages in thread
From: David Miller @ 2013-02-16 1:53 UTC (permalink / raw)
To: eric.dumazet; +Cc: pshelar, netdev, edumazet, jesse, bhutchings, mirqus
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Fri, 15 Feb 2013 16:52:35 -0800
> There is this "tx-nocache-copy" issue :
That scheme has so many system and device dependencies, but when
it does help it's nice to have.
Unfortunately I don't know the best way to proceed about that.
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2013-02-16 1:54 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-02-14 19:44 [PATCH net-next 0/3] v3 GRE: TCP segmentation offload Pravin B Shelar
2013-02-15 20:18 ` David Miller
2013-02-16 0:52 ` Eric Dumazet
2013-02-16 1:41 ` Pravin Shelar
2013-02-16 1:43 ` Eric Dumazet
2013-02-16 1:53 ` David Miller
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.