All of lore.kernel.org
 help / color / mirror / Atom feed
* unexpected GRO/veth behavior
@ 2018-09-10 14:44 Paolo Abeni
  2018-09-10 14:56 ` Eric Dumazet
  2018-09-11  6:54 ` Paolo Abeni
  0 siblings, 2 replies; 11+ messages in thread
From: Paolo Abeni @ 2018-09-10 14:44 UTC (permalink / raw)
  To: netdev; +Cc: eric.dumazet, Toshiaki Makita

hi all,

while testing some local patches I observed that the TCP tput in the
following scenario:

# the following enable napi on veth0, so that we can trigger the
# GRO path with namespaces
ip netns add test
ip link add type veth
ip link set dev veth0 netns test
ip -n test link set lo up
ip -n test link set veth0 up
ip -n test addr add dev veth0 172.16.1.2/24
ip link set dev veth1 up
ip addr add dev veth1 172.16.1.1/24
IDX=`ip netns exec test cat /sys/class/net/veth0/ifindex`

# 'xdp_pass' is a NO-OP XDP program that simply return XDP_PASS
ip netns exec test ./xdp_pass $IDX &
taskset 0x2 ip netns exec test iperf3 -s -i 60 &
taskset 0x1 iperf3 -c 172.16.1.2 -t 60 -i 60

is quite lower than expected (~800Mbps). 'perf' shows a weird topmost 
offender:

  80.42%  [kernel]           [k] find_bug

I *think* skb_gro_receive() does not behave correctly if !skb->sk, and
I experimented with the following patch, so far with success (in the
above scenario tput is now ~11Gbps). Am I missing something? it that
overkill?

Thank you,

Paolo

---
diff --git a/net/core/dev.c b/net/core/dev.c
index ca78dc5..94723b1 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5554,6 +5554,11 @@ struct packet_offload *gro_find_complete_by_type(__be16 type)
 
 static void napi_skb_free_stolen_head(struct sk_buff *skb)
 {
+	if (skb->destructor) {
+		WARN_ON(in_irq());
+		skb->destructor(skb);
+	}
+
 	skb_dst_drop(skb);
 	secpath_reset(skb);
 	kmem_cache_free(skbuff_head_cache, skb);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index c996c09..19f2fd9 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -3827,6 +3827,7 @@ int skb_gro_receive(struct sk_buff *p, struct sk_buff *skb)
 	unsigned int offset = skb_gro_offset(skb);
 	unsigned int headlen = skb_headlen(skb);
 	unsigned int len = skb_gro_len(skb);
+	unsigned int new_skb_truesize;
 	unsigned int delta_truesize;
 	struct sk_buff *lp;
 
@@ -3858,11 +3859,13 @@ int skb_gro_receive(struct sk_buff *p, struct sk_buff *skb)
 		frag->page_offset += offset;
 		skb_frag_size_sub(frag, offset);
 
-		/* all fragments truesize : remove (head size + sk_buff) */
-		delta_truesize = skb->truesize -
-				 SKB_TRUESIZE(skb_end_offset(skb));
+		/* all fragments truesize : remove (head size + sk_buff);
+		 * keep unchanged the amount of memory accounted to skb->sk
+		 */
+		new_skb_truesize = SKB_TRUESIZE(skb_end_offset(skb));
+		delta_truesize = skb->truesize - new_skb_truesize;
 
-		skb->truesize -= skb->data_len;
+		skb->truesize -= new_skb_truesize;
 		skb->len -= skb->data_len;
 		skb->data_len = 0;
 
@@ -3891,12 +3894,24 @@ int skb_gro_receive(struct sk_buff *p, struct sk_buff *skb)
 		memcpy(frag + 1, skbinfo->frags, sizeof(*frag) * skbinfo->nr_frags);
 		/* We dont need to clear skbinfo->nr_frags here */
 
-		delta_truesize = skb->truesize - SKB_DATA_ALIGN(sizeof(struct sk_buff));
+		/* keep unchanged the amount of memory accounted to skb->sk */
+		new_skb_truesize = SKB_DATA_ALIGN(sizeof(struct sk_buff));
+		delta_truesize = skb->truesize - new_skb_truesize;
+		skb->truesize = new_skb_truesize;
 		NAPI_GRO_CB(skb)->free = NAPI_GRO_FREE_STOLEN_HEAD;
 		goto done;
 	}
 
 merge:
+	if (skb->destructor) {
+		/* skb's truesize onwership is transferred to p, avoid
+		 * releasing it twice when p is freed
+		 */
+		WARN_ON_ONCE(p->sk != skb->sk || p->destructor != skb->destructor);
+		skb->sk = 0;
+		skb->destructor = NULL;
+	}
+
 	delta_truesize = skb->truesize;
 	if (offset > headlen) {
 		unsigned int eat = offset - headlen;

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: unexpected GRO/veth behavior
  2018-09-10 14:44 unexpected GRO/veth behavior Paolo Abeni
@ 2018-09-10 14:56 ` Eric Dumazet
  2018-09-10 15:22   ` Paolo Abeni
  2018-09-11  0:33   ` Toshiaki Makita
  2018-09-11  6:54 ` Paolo Abeni
  1 sibling, 2 replies; 11+ messages in thread
From: Eric Dumazet @ 2018-09-10 14:56 UTC (permalink / raw)
  To: Paolo Abeni, netdev; +Cc: eric.dumazet, Toshiaki Makita



On 09/10/2018 07:44 AM, Paolo Abeni wrote:
> hi all,
> 
> while testing some local patches I observed that the TCP tput in the
> following scenario:
> 
> # the following enable napi on veth0, so that we can trigger the
> # GRO path with namespaces
> ip netns add test
> ip link add type veth
> ip link set dev veth0 netns test
> ip -n test link set lo up
> ip -n test link set veth0 up
> ip -n test addr add dev veth0 172.16.1.2/24
> ip link set dev veth1 up
> ip addr add dev veth1 172.16.1.1/24
> IDX=`ip netns exec test cat /sys/class/net/veth0/ifindex`
> 
> # 'xdp_pass' is a NO-OP XDP program that simply return XDP_PASS
> ip netns exec test ./xdp_pass $IDX &
> taskset 0x2 ip netns exec test iperf3 -s -i 60 &
> taskset 0x1 iperf3 -c 172.16.1.2 -t 60 -i 60
> 
> is quite lower than expected (~800Mbps). 'perf' shows a weird topmost 
> offender:
>


But... why GRO would even be needed in this scenario ?

GRO is really meant for physical devices, having to mess with skb->sk adds extra cost
in this already heavy cost engine.

Virtual devices should already be fed with TSO packets.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: unexpected GRO/veth behavior
  2018-09-10 14:56 ` Eric Dumazet
@ 2018-09-10 15:22   ` Paolo Abeni
  2018-09-10 17:06     ` Eric Dumazet
  2018-09-11  0:33   ` Toshiaki Makita
  1 sibling, 1 reply; 11+ messages in thread
From: Paolo Abeni @ 2018-09-10 15:22 UTC (permalink / raw)
  To: Eric Dumazet, netdev; +Cc: Toshiaki Makita

On Mon, 2018-09-10 at 07:56 -0700, Eric Dumazet wrote:
> 
> On 09/10/2018 07:44 AM, Paolo Abeni wrote:
> > hi all,
> > 
> > while testing some local patches I observed that the TCP tput in the
> > following scenario:
> > 
> > # the following enable napi on veth0, so that we can trigger the
> > # GRO path with namespaces
> > ip netns add test
> > ip link add type veth
> > ip link set dev veth0 netns test
> > ip -n test link set lo up
> > ip -n test link set veth0 up
> > ip -n test addr add dev veth0 172.16.1.2/24
> > ip link set dev veth1 up
> > ip addr add dev veth1 172.16.1.1/24
> > IDX=`ip netns exec test cat /sys/class/net/veth0/ifindex`
> > 
> > # 'xdp_pass' is a NO-OP XDP program that simply return XDP_PASS
> > ip netns exec test ./xdp_pass $IDX &
> > taskset 0x2 ip netns exec test iperf3 -s -i 60 &
> > taskset 0x1 iperf3 -c 172.16.1.2 -t 60 -i 60
> > 
> > is quite lower than expected (~800Mbps). 'perf' shows a weird topmost 
> > offender:
> > 
> 
> 
> But... why GRO would even be needed in this scenario ?

AFAICS, attaching an XDP program to a veth device makes TCP flows over
such veth unconditionally hit this code path since:

commit 948d4f214fde43743c57aae0c708bff44f6345f2
Author: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Date:   Fri Aug 3 16:58:10 2018 +0900

    veth: Add driver XDP

I'm personally looking for some way to hit the GRO code path with
selftest/namespaces.

> GRO is really meant for physical devices, having to mess with skb->sk adds extra cost
> in this already heavy cost engine.

Yup, even if I do not see any measurable cost added by the posted code.

Cheers,

Paolo

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: unexpected GRO/veth behavior
  2018-09-10 15:22   ` Paolo Abeni
@ 2018-09-10 17:06     ` Eric Dumazet
  0 siblings, 0 replies; 11+ messages in thread
From: Eric Dumazet @ 2018-09-10 17:06 UTC (permalink / raw)
  To: Paolo Abeni, Eric Dumazet, netdev; +Cc: Toshiaki Makita



On 09/10/2018 08:22 AM, Paolo Abeni wrote:
 in this already heavy cost engine.
> 
> Yup, even if I do not see any measurable cost added by the posted code.

Sure, micro bench marks wont show anything.

Now, if GRO receives one packet every 100 usec, as many hosts in the wild do,
there is an additional cost because of icache being wasted.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: unexpected GRO/veth behavior
  2018-09-10 14:56 ` Eric Dumazet
  2018-09-10 15:22   ` Paolo Abeni
@ 2018-09-11  0:33   ` Toshiaki Makita
  1 sibling, 0 replies; 11+ messages in thread
From: Toshiaki Makita @ 2018-09-11  0:33 UTC (permalink / raw)
  To: Eric Dumazet, Paolo Abeni, netdev

On 2018/09/10 23:56, Eric Dumazet wrote:
> On 09/10/2018 07:44 AM, Paolo Abeni wrote:
>> hi all,
>>
>> while testing some local patches I observed that the TCP tput in the
>> following scenario:
>>
>> # the following enable napi on veth0, so that we can trigger the
>> # GRO path with namespaces
>> ip netns add test
>> ip link add type veth
>> ip link set dev veth0 netns test
>> ip -n test link set lo up
>> ip -n test link set veth0 up
>> ip -n test addr add dev veth0 172.16.1.2/24
>> ip link set dev veth1 up
>> ip addr add dev veth1 172.16.1.1/24
>> IDX=`ip netns exec test cat /sys/class/net/veth0/ifindex`
>>
>> # 'xdp_pass' is a NO-OP XDP program that simply return XDP_PASS
>> ip netns exec test ./xdp_pass $IDX &
>> taskset 0x2 ip netns exec test iperf3 -s -i 60 &
>> taskset 0x1 iperf3 -c 172.16.1.2 -t 60 -i 60
>>
>> is quite lower than expected (~800Mbps). 'perf' shows a weird topmost 
>> offender:
>>
> 
> 
> But... why GRO would even be needed in this scenario ?
> 
> GRO is really meant for physical devices, having to mess with skb->sk adds extra cost
> in this already heavy cost engine.
> 
> Virtual devices should already be fed with TSO packets.

Because XDP does not have SG feature (GRO path in veth is used only when
XDP is enabled).

I have tested configuration like this:

NIC ---(XDP_REDIRECT)---> veth===veth (XDP_PASS)

GRO seems to work and improves TCP throughput in this case.


Now I noticed I did not test:

netperf -> veth===veth (XDP_PASS) -> netserver

which I think is the case where Paolo faces a problem.

I think it is not the case XDP can improve performance. I think I can
disable GRO for packets with skb->sk != NULL in veth.

-- 
Toshiaki Makita

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: unexpected GRO/veth behavior
  2018-09-10 14:44 unexpected GRO/veth behavior Paolo Abeni
  2018-09-10 14:56 ` Eric Dumazet
@ 2018-09-11  6:54 ` Paolo Abeni
  2018-09-11 10:27   ` Eric Dumazet
  1 sibling, 1 reply; 11+ messages in thread
From: Paolo Abeni @ 2018-09-11  6:54 UTC (permalink / raw)
  To: netdev; +Cc: eric.dumazet, Toshiaki Makita

Hi,

On Mon, 2018-09-10 at 16:44 +0200, Paolo Abeni wrote:
> while testing some local patches I observed that the TCP tput in the
> following scenario:
> 
> # the following enable napi on veth0, so that we can trigger the
> # GRO path with namespaces
> ip netns add test
> ip link add type veth
> ip link set dev veth0 netns test
> ip -n test link set lo up
> ip -n test link set veth0 up
> ip -n test addr add dev veth0 172.16.1.2/24
> ip link set dev veth1 up
> ip addr add dev veth1 172.16.1.1/24
> IDX=`ip netns exec test cat /sys/class/net/veth0/ifindex`
> 
> # 'xdp_pass' is a NO-OP XDP program that simply return XDP_PASS
> ip netns exec test ./xdp_pass $IDX &
> taskset 0x2 ip netns exec test iperf3 -s -i 60 &
> taskset 0x1 iperf3 -c 172.16.1.2 -t 60 -i 60

In the same scenario, using instead:

iperf3 -c 172.16.1.2 -t 600  -i 60 -N -l 10K

I hit the following splat, on a recent, unpatched net-next:

[  362.098904] refcount_t overflow at skb_set_owner_w+0x5e/0xa0 in iperf3[1644], uid/euid: 0/0
[  362.108239] WARNING: CPU: 0 PID: 1644 at kernel/panic.c:648 refcount_error_report+0xa0/0xa4
[  362.117547] Modules linked in: tcp_diag inet_diag veth intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel intel_cstate intel_uncore intel_rapl_perf ipmi_ssif iTCO_wdt sg ipmi_si iTCO_vendor_support ipmi_devintf mxm_wmi ipmi_msghandler pcspkr dcdbas mei_me wmi mei lpc_ich acpi_power_meter pcc_cpufreq xfs libcrc32c sd_mod mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ixgbe igb ttm ahci mdio libahci ptp crc32c_intel drm pps_core libata i2c_algo_bit dca dm_mirror dm_region_hash dm_log dm_mod
[  362.176622] CPU: 0 PID: 1644 Comm: iperf3 Not tainted 4.19.0-rc2.vanilla+ #2025
[  362.184777] Hardware name: Dell Inc. PowerEdge R730/072T6D, BIOS 2.1.7 06/16/2016
[  362.193124] RIP: 0010:refcount_error_report+0xa0/0xa4
[  362.198758] Code: 08 00 00 48 8b 95 80 00 00 00 49 8d 8c 24 80 0a 00 00 41 89 c1 44 89 2c 24 48 89 de 48 c7 c7 18 4d e7 9d 31 c0 e8 30 fa ff ff <0f> 0b eb 88 0f 1f 44 00 00 55 48 89 e5 41 56 41 55 41 54 49 89 fc
[  362.219711] RSP: 0018:ffff9ee6ff603c20 EFLAGS: 00010282
[  362.225538] RAX: 0000000000000000 RBX: ffffffff9de83e10 RCX: 0000000000000000
[  362.233497] RDX: 0000000000000001 RSI: ffff9ee6ff6167d8 RDI: ffff9ee6ff6167d8
[  362.241457] RBP: ffff9ee6ff603d78 R08: 0000000000000490 R09: 0000000000000004
[  362.249416] R10: 0000000000000000 R11: ffff9ee6ff603990 R12: ffff9ee664b94500
[  362.257377] R13: 0000000000000000 R14: 0000000000000004 R15: ffffffff9de615f9
[  362.265337] FS:  00007f1d22d28740(0000) GS:ffff9ee6ff600000(0000) knlGS:0000000000000000
[  362.274363] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  362.280773] CR2: 00007f1d222f35d0 CR3: 0000001fddfec003 CR4: 00000000001606f0
[  362.288733] Call Trace:
[  362.291459]  <IRQ>
[  362.293702]  ex_handler_refcount+0x4e/0x80
[  362.298269]  fixup_exception+0x35/0x40
[  362.302451]  do_trap+0x109/0x150
[  362.306048]  do_error_trap+0xd5/0x130
[  362.315766]  invalid_op+0x14/0x20
[  362.319460] RIP: 0010:skb_set_owner_w+0x5e/0xa0
[  362.324512] Code: ef ff ff 74 49 48 c7 43 60 20 7b 4a 9d 8b 85 f4 01 00 00 85 c0 75 16 8b 83 e0 00 00 00 f0 01 85 44 01 00 00 0f 88 d8 23 16 00 <5b> 5d c3 80 8b 91 00 00 00 01 8b 85 f4 01 00 00 89 83 a4 00 00 00
[  362.345465] RSP: 0018:ffff9ee6ff603e20 EFLAGS: 00010a86
[  362.351291] RAX: 0000000000001100 RBX: ffff9ee65deec700 RCX: ffff9ee65e829244
[  362.359250] RDX: 0000000000000100 RSI: ffff9ee65e829100 RDI: ffff9ee65deec700
[  362.367210] RBP: ffff9ee65e829100 R08: 000000000002a380 R09: 0000000000000000
[  362.375169] R10: 0000000000000002 R11: fffff1a4bf77bb00 R12: ffffc0754661d000
[  362.383130] R13: ffff9ee65deec200 R14: ffff9ee65f597000 R15: 00000000000000aa
[  362.391092]  veth_xdp_rcv+0x4e4/0x890 [veth]
[  362.399357]  veth_poll+0x4d/0x17a [veth]
[  362.403731]  net_rx_action+0x2af/0x3f0
[  362.407912]  __do_softirq+0xdd/0x29e
[  362.411897]  do_softirq_own_stack+0x2a/0x40
[  362.416561]  </IRQ>
[  362.418899]  do_softirq+0x4b/0x70
[  362.422594]  __local_bh_enable_ip+0x50/0x60
[  362.427258]  ip_finish_output2+0x16a/0x390
[  362.431824]  ip_output+0x71/0xe0
[  362.440670]  __tcp_transmit_skb+0x583/0xab0
[  362.445333]  tcp_write_xmit+0x247/0xfb0
[  362.449609]  __tcp_push_pending_frames+0x2d/0xd0
[  362.454760]  tcp_sendmsg_locked+0x857/0xd30
[  362.459424]  tcp_sendmsg+0x27/0x40
[  362.463216]  sock_sendmsg+0x36/0x50
[  362.467104]  sock_write_iter+0x87/0x100
[  362.471382]  __vfs_write+0x112/0x1a0
[  362.475369]  vfs_write+0xad/0x1a0
[  362.479062]  ksys_write+0x52/0xc0
[  362.482759]  do_syscall_64+0x5b/0x180
[  362.486841]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  362.492473] RIP: 0033:0x7f1d22293238
[  362.496458] Code: 89 02 48 c7 c0 ff ff ff ff eb b3 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 c5 54 2d 00 8b 00 85 c0 75 17 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 41 54 49 89 d4 55
[  362.517409] RSP: 002b:00007ffebaef8008 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[  362.525855] RAX: ffffffffffffffda RBX: 0000000000002800 RCX: 00007f1d22293238
[  362.533816] RDX: 0000000000002800 RSI: 00007f1d22d36000 RDI: 0000000000000005
[  362.541775] RBP: 00007f1d22d36000 R08: 00000002db777a30 R09: 0000562b70712b20
[  362.549734] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000005
[  362.557693] R13: 0000000000002800 R14: 00007ffebaef8060 R15: 0000562b70712260

The problem, AFAICS, is that the GRO path changes the cumulative
truesize of the skbs entering such code path without updating
sk_wmem_alloc. The posted code tries to keep unchanged such cumulative
truesize.

I *think* we can hit a similar condition with a tun device in IFF_NAPI
mode.

Cheers,

Paolo

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: unexpected GRO/veth behavior
  2018-09-11  6:54 ` Paolo Abeni
@ 2018-09-11 10:27   ` Eric Dumazet
  2018-09-11 11:07     ` Toshiaki Makita
  2018-09-11 11:44     ` Paolo Abeni
  0 siblings, 2 replies; 11+ messages in thread
From: Eric Dumazet @ 2018-09-11 10:27 UTC (permalink / raw)
  To: Paolo Abeni, netdev; +Cc: eric.dumazet, Toshiaki Makita



On 09/10/2018 11:54 PM, Paolo Abeni wrote:
> Hi,
> 
> On Mon, 2018-09-10 at 16:44 +0200, Paolo Abeni wrote:
>> while testing some local patches I observed that the TCP tput in the
>> following scenario:
>>
>> # the following enable napi on veth0, so that we can trigger the
>> # GRO path with namespaces
>> ip netns add test
>> ip link add type veth
>> ip link set dev veth0 netns test
>> ip -n test link set lo up
>> ip -n test link set veth0 up
>> ip -n test addr add dev veth0 172.16.1.2/24
>> ip link set dev veth1 up
>> ip addr add dev veth1 172.16.1.1/24
>> IDX=`ip netns exec test cat /sys/class/net/veth0/ifindex`
>>
>> # 'xdp_pass' is a NO-OP XDP program that simply return XDP_PASS
>> ip netns exec test ./xdp_pass $IDX &
>> taskset 0x2 ip netns exec test iperf3 -s -i 60 &
>> taskset 0x1 iperf3 -c 172.16.1.2 -t 60 -i 60
> 
> In the same scenario, using instead:
> 
> iperf3 -c 172.16.1.2 -t 600  -i 60 -N -l 10K
> 
> I hit the following splat, on a recent, unpatched net-next:
> 
> [  362.098904] refcount_t overflow at skb_set_owner_w+0x5e/0xa0 in iperf3[1644], uid/euid: 0/0
> [  362.108239] WARNING: CPU: 0 PID: 1644 at kernel/panic.c:648 refcount_error_report+0xa0/0xa4
> [  362.117547] Modules linked in: tcp_diag inet_diag veth intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel intel_cstate intel_uncore intel_rapl_perf ipmi_ssif iTCO_wdt sg ipmi_si iTCO_vendor_support ipmi_devintf mxm_wmi ipmi_msghandler pcspkr dcdbas mei_me wmi mei lpc_ich acpi_power_meter pcc_cpufreq xfs libcrc32c sd_mod mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ixgbe igb ttm ahci mdio libahci ptp crc32c_intel drm pps_core libata i2c_algo_bit dca dm_mirror dm_region_hash dm_log dm_mod
> [  362.176622] CPU: 0 PID: 1644 Comm: iperf3 Not tainted 4.19.0-rc2.vanilla+ #2025
> [  362.184777] Hardware name: Dell Inc. PowerEdge R730/072T6D, BIOS 2.1.7 06/16/2016
> [  362.193124] RIP: 0010:refcount_error_report+0xa0/0xa4
> [  362.198758] Code: 08 00 00 48 8b 95 80 00 00 00 49 8d 8c 24 80 0a 00 00 41 89 c1 44 89 2c 24 48 89 de 48 c7 c7 18 4d e7 9d 31 c0 e8 30 fa ff ff <0f> 0b eb 88 0f 1f 44 00 00 55 48 89 e5 41 56 41 55 41 54 49 89 fc
> [  362.219711] RSP: 0018:ffff9ee6ff603c20 EFLAGS: 00010282
> [  362.225538] RAX: 0000000000000000 RBX: ffffffff9de83e10 RCX: 0000000000000000
> [  362.233497] RDX: 0000000000000001 RSI: ffff9ee6ff6167d8 RDI: ffff9ee6ff6167d8
> [  362.241457] RBP: ffff9ee6ff603d78 R08: 0000000000000490 R09: 0000000000000004
> [  362.249416] R10: 0000000000000000 R11: ffff9ee6ff603990 R12: ffff9ee664b94500
> [  362.257377] R13: 0000000000000000 R14: 0000000000000004 R15: ffffffff9de615f9
> [  362.265337] FS:  00007f1d22d28740(0000) GS:ffff9ee6ff600000(0000) knlGS:0000000000000000
> [  362.274363] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  362.280773] CR2: 00007f1d222f35d0 CR3: 0000001fddfec003 CR4: 00000000001606f0
> [  362.288733] Call Trace:
> [  362.291459]  <IRQ>
> [  362.293702]  ex_handler_refcount+0x4e/0x80
> [  362.298269]  fixup_exception+0x35/0x40
> [  362.302451]  do_trap+0x109/0x150
> [  362.306048]  do_error_trap+0xd5/0x130
> [  362.315766]  invalid_op+0x14/0x20
> [  362.319460] RIP: 0010:skb_set_owner_w+0x5e/0xa0
> [  362.324512] Code: ef ff ff 74 49 48 c7 43 60 20 7b 4a 9d 8b 85 f4 01 00 00 85 c0 75 16 8b 83 e0 00 00 00 f0 01 85 44 01 00 00 0f 88 d8 23 16 00 <5b> 5d c3 80 8b 91 00 00 00 01 8b 85 f4 01 00 00 89 83 a4 00 00 00
> [  362.345465] RSP: 0018:ffff9ee6ff603e20 EFLAGS: 00010a86
> [  362.351291] RAX: 0000000000001100 RBX: ffff9ee65deec700 RCX: ffff9ee65e829244
> [  362.359250] RDX: 0000000000000100 RSI: ffff9ee65e829100 RDI: ffff9ee65deec700
> [  362.367210] RBP: ffff9ee65e829100 R08: 000000000002a380 R09: 0000000000000000
> [  362.375169] R10: 0000000000000002 R11: fffff1a4bf77bb00 R12: ffffc0754661d000
> [  362.383130] R13: ffff9ee65deec200 R14: ffff9ee65f597000 R15: 00000000000000aa
> [  362.391092]  veth_xdp_rcv+0x4e4/0x890 [veth]
> [  362.399357]  veth_poll+0x4d/0x17a [veth]
> [  362.403731]  net_rx_action+0x2af/0x3f0
> [  362.407912]  __do_softirq+0xdd/0x29e
> [  362.411897]  do_softirq_own_stack+0x2a/0x40
> [  362.416561]  </IRQ>
> [  362.418899]  do_softirq+0x4b/0x70
> [  362.422594]  __local_bh_enable_ip+0x50/0x60
> [  362.427258]  ip_finish_output2+0x16a/0x390
> [  362.431824]  ip_output+0x71/0xe0
> [  362.440670]  __tcp_transmit_skb+0x583/0xab0
> [  362.445333]  tcp_write_xmit+0x247/0xfb0
> [  362.449609]  __tcp_push_pending_frames+0x2d/0xd0
> [  362.454760]  tcp_sendmsg_locked+0x857/0xd30
> [  362.459424]  tcp_sendmsg+0x27/0x40
> [  362.463216]  sock_sendmsg+0x36/0x50
> [  362.467104]  sock_write_iter+0x87/0x100
> [  362.471382]  __vfs_write+0x112/0x1a0
> [  362.475369]  vfs_write+0xad/0x1a0
> [  362.479062]  ksys_write+0x52/0xc0
> [  362.482759]  do_syscall_64+0x5b/0x180
> [  362.486841]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [  362.492473] RIP: 0033:0x7f1d22293238
> [  362.496458] Code: 89 02 48 c7 c0 ff ff ff ff eb b3 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 c5 54 2d 00 8b 00 85 c0 75 17 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 41 54 49 89 d4 55
> [  362.517409] RSP: 002b:00007ffebaef8008 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
> [  362.525855] RAX: ffffffffffffffda RBX: 0000000000002800 RCX: 00007f1d22293238
> [  362.533816] RDX: 0000000000002800 RSI: 00007f1d22d36000 RDI: 0000000000000005
> [  362.541775] RBP: 00007f1d22d36000 R08: 00000002db777a30 R09: 0000562b70712b20
> [  362.549734] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000005
> [  362.557693] R13: 0000000000002800 R14: 00007ffebaef8060 R15: 0000562b70712260
> 
> The problem, AFAICS, is that the GRO path changes the cumulative
> truesize of the skbs entering such code path without updating
> sk_wmem_alloc. The posted code tries to keep unchanged such cumulative
> truesize.
> 

As I said, skbs entering GRO should not have skb->sk set.
GRO fully owns skbs.

No need to convince us.

For some reason, Toshiaki Makita added XDP and GRO, and broke veth



> I *think* we can hit a similar condition with a tun device in IFF_NAPI
> mode.

Why ?

tun_get_user() does not attach skb to a socket, that would be quite useless since skb
is entering input path and would be orphaned right away.


Fix would probably be :

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 8d679c8b7f25c753d77cfb8821d9d2528c9c9048..96bd94480942b469403abf017f9f9d5be1e23ef5 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -602,9 +602,10 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget, unsigned int *xdp_xmit)
                        skb = veth_xdp_rcv_skb(rq, ptr, xdp_xmit);
                }
 
-               if (skb)
+               if (skb) {
+                       skb_orphan(skb);
                        napi_gro_receive(&rq->xdp_napi, skb);
-
+               }
                done++;
        }
 

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: unexpected GRO/veth behavior
  2018-09-11 10:27   ` Eric Dumazet
@ 2018-09-11 11:07     ` Toshiaki Makita
  2018-09-14  2:51       ` Toshiaki Makita
  2018-09-11 11:44     ` Paolo Abeni
  1 sibling, 1 reply; 11+ messages in thread
From: Toshiaki Makita @ 2018-09-11 11:07 UTC (permalink / raw)
  To: Eric Dumazet, Paolo Abeni, netdev

On 2018/09/11 19:27, Eric Dumazet wrote:
...
> Fix would probably be :
> 
> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
> index 8d679c8b7f25c753d77cfb8821d9d2528c9c9048..96bd94480942b469403abf017f9f9d5be1e23ef5 100644
> --- a/drivers/net/veth.c
> +++ b/drivers/net/veth.c
> @@ -602,9 +602,10 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget, unsigned int *xdp_xmit)
>                         skb = veth_xdp_rcv_skb(rq, ptr, xdp_xmit);
>                 }
>  
> -               if (skb)
> +               if (skb) {
> +                       skb_orphan(skb);
>                         napi_gro_receive(&rq->xdp_napi, skb);
> -
> +               }
>                 done++;
>         }

Considering commit 9c4c3252 ("skbuff: preserve sock reference when
scrubbing the skb.") I'm not sure if we should unconditionally orphan
the skb here.
I was thinking I should call netif_receive_skb() for such packets
instead of napi_gro_receive().

-- 
Toshiaki Makita

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: unexpected GRO/veth behavior
  2018-09-11 10:27   ` Eric Dumazet
  2018-09-11 11:07     ` Toshiaki Makita
@ 2018-09-11 11:44     ` Paolo Abeni
  2018-09-11 15:42       ` Eric Dumazet
  1 sibling, 1 reply; 11+ messages in thread
From: Paolo Abeni @ 2018-09-11 11:44 UTC (permalink / raw)
  To: Eric Dumazet, netdev; +Cc: Toshiaki Makita

On Tue, 2018-09-11 at 03:27 -0700, Eric Dumazet wrote:
> On 09/10/2018 11:54 PM, Paolo Abeni wrote:
> > I *think* we can hit a similar condition with a tun device in IFF_NAPI
> > mode.
> 
> Why ?
> 
> tun_get_user() does not attach skb to a socket, that would be quite useless since skb
> 
	
I think we can hit this code path:

tun_get_user() -> tun_alloc_skb() -> sock_alloc_send_pskb() ->
skb_set_owner_w() 

Than later napi and GRO.

Cheers,

Paolo

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: unexpected GRO/veth behavior
  2018-09-11 11:44     ` Paolo Abeni
@ 2018-09-11 15:42       ` Eric Dumazet
  0 siblings, 0 replies; 11+ messages in thread
From: Eric Dumazet @ 2018-09-11 15:42 UTC (permalink / raw)
  To: Paolo Abeni, Eric Dumazet, netdev; +Cc: Toshiaki Makita



On 09/11/2018 04:44 AM, Paolo Abeni wrote:
> On Tue, 2018-09-11 at 03:27 -0700, Eric Dumazet wrote:
>> On 09/10/2018 11:54 PM, Paolo Abeni wrote:
>>> I *think* we can hit a similar condition with a tun device in IFF_NAPI
>>> mode.
>>
>> Why ?
>>
>> tun_get_user() does not attach skb to a socket, that would be quite useless since skb
>>
> 	
> I think we can hit this code path:
> 
> tun_get_user() -> tun_alloc_skb() -> sock_alloc_send_pskb() ->
> skb_set_owner_w() 
> 
> Than later napi and GRO.
> 


No we can not.

Please look again.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: unexpected GRO/veth behavior
  2018-09-11 11:07     ` Toshiaki Makita
@ 2018-09-14  2:51       ` Toshiaki Makita
  0 siblings, 0 replies; 11+ messages in thread
From: Toshiaki Makita @ 2018-09-14  2:51 UTC (permalink / raw)
  To: Eric Dumazet, Paolo Abeni, netdev

On 2018/09/11 20:07, Toshiaki Makita wrote:
> On 2018/09/11 19:27, Eric Dumazet wrote:
> ...
>> Fix would probably be :
>>
>> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
>> index 8d679c8b7f25c753d77cfb8821d9d2528c9c9048..96bd94480942b469403abf017f9f9d5be1e23ef5 100644
>> --- a/drivers/net/veth.c
>> +++ b/drivers/net/veth.c
>> @@ -602,9 +602,10 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget, unsigned int *xdp_xmit)
>>                         skb = veth_xdp_rcv_skb(rq, ptr, xdp_xmit);
>>                 }
>>  
>> -               if (skb)
>> +               if (skb) {
>> +                       skb_orphan(skb);
>>                         napi_gro_receive(&rq->xdp_napi, skb);
>> -
>> +               }
>>                 done++;
>>         }
> 
> Considering commit 9c4c3252 ("skbuff: preserve sock reference when
> scrubbing the skb.") I'm not sure if we should unconditionally orphan
> the skb here.
> I was thinking I should call netif_receive_skb() for such packets
> instead of napi_gro_receive().

I tested TCP throughput within localhost with XDP enabled (with
skb_orphan() fix).

GRO off: 4.7 Gbps
GRO on : 6.7 Gbps

Since there is not-so-small difference, I'm making a patch which orphan
the skb as Eric suggested (but in veth_xdp_rcv_skb() instead).

Thanks!

-- 
Toshiaki Makita

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2018-09-14  8:04 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-09-10 14:44 unexpected GRO/veth behavior Paolo Abeni
2018-09-10 14:56 ` Eric Dumazet
2018-09-10 15:22   ` Paolo Abeni
2018-09-10 17:06     ` Eric Dumazet
2018-09-11  0:33   ` Toshiaki Makita
2018-09-11  6:54 ` Paolo Abeni
2018-09-11 10:27   ` Eric Dumazet
2018-09-11 11:07     ` Toshiaki Makita
2018-09-14  2:51       ` Toshiaki Makita
2018-09-11 11:44     ` Paolo Abeni
2018-09-11 15:42       ` Eric Dumazet

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.