From mboxrd@z Thu Jan 1 00:00:00 1970 From: Eric Dumazet Subject: Re: unexpected GRO/veth behavior Date: Tue, 11 Sep 2018 03:27:59 -0700 Message-ID: References: <4106d3f7eee7f0186fcfdd0331cdafeecd3240c0.camel@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Cc: eric.dumazet@gmail.com, Toshiaki Makita To: Paolo Abeni , netdev@vger.kernel.org Return-path: Received: from mail-wr1-f66.google.com ([209.85.221.66]:36672 "EHLO mail-wr1-f66.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726613AbeIKP0q (ORCPT ); Tue, 11 Sep 2018 11:26:46 -0400 Received: by mail-wr1-f66.google.com with SMTP id e1-v6so16333840wrt.3 for ; Tue, 11 Sep 2018 03:28:03 -0700 (PDT) In-Reply-To: Content-Language: en-US Sender: netdev-owner@vger.kernel.org List-ID: On 09/10/2018 11:54 PM, Paolo Abeni wrote: > Hi, > > On Mon, 2018-09-10 at 16:44 +0200, Paolo Abeni wrote: >> while testing some local patches I observed that the TCP tput in the >> following scenario: >> >> # the following enable napi on veth0, so that we can trigger the >> # GRO path with namespaces >> ip netns add test >> ip link add type veth >> ip link set dev veth0 netns test >> ip -n test link set lo up >> ip -n test link set veth0 up >> ip -n test addr add dev veth0 172.16.1.2/24 >> ip link set dev veth1 up >> ip addr add dev veth1 172.16.1.1/24 >> IDX=`ip netns exec test cat /sys/class/net/veth0/ifindex` >> >> # 'xdp_pass' is a NO-OP XDP program that simply return XDP_PASS >> ip netns exec test ./xdp_pass $IDX & >> taskset 0x2 ip netns exec test iperf3 -s -i 60 & >> taskset 0x1 iperf3 -c 172.16.1.2 -t 60 -i 60 > > In the same scenario, using instead: > > iperf3 -c 172.16.1.2 -t 600 -i 60 -N -l 10K > > I hit the following splat, on a recent, unpatched net-next: > > [ 362.098904] refcount_t overflow at skb_set_owner_w+0x5e/0xa0 in iperf3[1644], uid/euid: 0/0 > [ 362.108239] WARNING: CPU: 0 PID: 1644 at kernel/panic.c:648 refcount_error_report+0xa0/0xa4 > [ 362.117547] Modules linked in: tcp_diag inet_diag veth intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel intel_cstate intel_uncore intel_rapl_perf ipmi_ssif iTCO_wdt sg ipmi_si iTCO_vendor_support ipmi_devintf mxm_wmi ipmi_msghandler pcspkr dcdbas mei_me wmi mei lpc_ich acpi_power_meter pcc_cpufreq xfs libcrc32c sd_mod mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ixgbe igb ttm ahci mdio libahci ptp crc32c_intel drm pps_core libata i2c_algo_bit dca dm_mirror dm_region_hash dm_log dm_mod > [ 362.176622] CPU: 0 PID: 1644 Comm: iperf3 Not tainted 4.19.0-rc2.vanilla+ #2025 > [ 362.184777] Hardware name: Dell Inc. PowerEdge R730/072T6D, BIOS 2.1.7 06/16/2016 > [ 362.193124] RIP: 0010:refcount_error_report+0xa0/0xa4 > [ 362.198758] Code: 08 00 00 48 8b 95 80 00 00 00 49 8d 8c 24 80 0a 00 00 41 89 c1 44 89 2c 24 48 89 de 48 c7 c7 18 4d e7 9d 31 c0 e8 30 fa ff ff <0f> 0b eb 88 0f 1f 44 00 00 55 48 89 e5 41 56 41 55 41 54 49 89 fc > [ 362.219711] RSP: 0018:ffff9ee6ff603c20 EFLAGS: 00010282 > [ 362.225538] RAX: 0000000000000000 RBX: ffffffff9de83e10 RCX: 0000000000000000 > [ 362.233497] RDX: 0000000000000001 RSI: ffff9ee6ff6167d8 RDI: ffff9ee6ff6167d8 > [ 362.241457] RBP: ffff9ee6ff603d78 R08: 0000000000000490 R09: 0000000000000004 > [ 362.249416] R10: 0000000000000000 R11: ffff9ee6ff603990 R12: ffff9ee664b94500 > [ 362.257377] R13: 0000000000000000 R14: 0000000000000004 R15: ffffffff9de615f9 > [ 362.265337] FS: 00007f1d22d28740(0000) GS:ffff9ee6ff600000(0000) knlGS:0000000000000000 > [ 362.274363] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 362.280773] CR2: 00007f1d222f35d0 CR3: 0000001fddfec003 CR4: 00000000001606f0 > [ 362.288733] Call Trace: > [ 362.291459] > [ 362.293702] ex_handler_refcount+0x4e/0x80 > [ 362.298269] fixup_exception+0x35/0x40 > [ 362.302451] do_trap+0x109/0x150 > [ 362.306048] do_error_trap+0xd5/0x130 > [ 362.315766] invalid_op+0x14/0x20 > [ 362.319460] RIP: 0010:skb_set_owner_w+0x5e/0xa0 > [ 362.324512] Code: ef ff ff 74 49 48 c7 43 60 20 7b 4a 9d 8b 85 f4 01 00 00 85 c0 75 16 8b 83 e0 00 00 00 f0 01 85 44 01 00 00 0f 88 d8 23 16 00 <5b> 5d c3 80 8b 91 00 00 00 01 8b 85 f4 01 00 00 89 83 a4 00 00 00 > [ 362.345465] RSP: 0018:ffff9ee6ff603e20 EFLAGS: 00010a86 > [ 362.351291] RAX: 0000000000001100 RBX: ffff9ee65deec700 RCX: ffff9ee65e829244 > [ 362.359250] RDX: 0000000000000100 RSI: ffff9ee65e829100 RDI: ffff9ee65deec700 > [ 362.367210] RBP: ffff9ee65e829100 R08: 000000000002a380 R09: 0000000000000000 > [ 362.375169] R10: 0000000000000002 R11: fffff1a4bf77bb00 R12: ffffc0754661d000 > [ 362.383130] R13: ffff9ee65deec200 R14: ffff9ee65f597000 R15: 00000000000000aa > [ 362.391092] veth_xdp_rcv+0x4e4/0x890 [veth] > [ 362.399357] veth_poll+0x4d/0x17a [veth] > [ 362.403731] net_rx_action+0x2af/0x3f0 > [ 362.407912] __do_softirq+0xdd/0x29e > [ 362.411897] do_softirq_own_stack+0x2a/0x40 > [ 362.416561] > [ 362.418899] do_softirq+0x4b/0x70 > [ 362.422594] __local_bh_enable_ip+0x50/0x60 > [ 362.427258] ip_finish_output2+0x16a/0x390 > [ 362.431824] ip_output+0x71/0xe0 > [ 362.440670] __tcp_transmit_skb+0x583/0xab0 > [ 362.445333] tcp_write_xmit+0x247/0xfb0 > [ 362.449609] __tcp_push_pending_frames+0x2d/0xd0 > [ 362.454760] tcp_sendmsg_locked+0x857/0xd30 > [ 362.459424] tcp_sendmsg+0x27/0x40 > [ 362.463216] sock_sendmsg+0x36/0x50 > [ 362.467104] sock_write_iter+0x87/0x100 > [ 362.471382] __vfs_write+0x112/0x1a0 > [ 362.475369] vfs_write+0xad/0x1a0 > [ 362.479062] ksys_write+0x52/0xc0 > [ 362.482759] do_syscall_64+0x5b/0x180 > [ 362.486841] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > [ 362.492473] RIP: 0033:0x7f1d22293238 > [ 362.496458] Code: 89 02 48 c7 c0 ff ff ff ff eb b3 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 c5 54 2d 00 8b 00 85 c0 75 17 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 41 54 49 89 d4 55 > [ 362.517409] RSP: 002b:00007ffebaef8008 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 > [ 362.525855] RAX: ffffffffffffffda RBX: 0000000000002800 RCX: 00007f1d22293238 > [ 362.533816] RDX: 0000000000002800 RSI: 00007f1d22d36000 RDI: 0000000000000005 > [ 362.541775] RBP: 00007f1d22d36000 R08: 00000002db777a30 R09: 0000562b70712b20 > [ 362.549734] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000005 > [ 362.557693] R13: 0000000000002800 R14: 00007ffebaef8060 R15: 0000562b70712260 > > The problem, AFAICS, is that the GRO path changes the cumulative > truesize of the skbs entering such code path without updating > sk_wmem_alloc. The posted code tries to keep unchanged such cumulative > truesize. > As I said, skbs entering GRO should not have skb->sk set. GRO fully owns skbs. No need to convince us. For some reason, Toshiaki Makita added XDP and GRO, and broke veth > I *think* we can hit a similar condition with a tun device in IFF_NAPI > mode. Why ? tun_get_user() does not attach skb to a socket, that would be quite useless since skb is entering input path and would be orphaned right away. Fix would probably be : diff --git a/drivers/net/veth.c b/drivers/net/veth.c index 8d679c8b7f25c753d77cfb8821d9d2528c9c9048..96bd94480942b469403abf017f9f9d5be1e23ef5 100644 --- a/drivers/net/veth.c +++ b/drivers/net/veth.c @@ -602,9 +602,10 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget, unsigned int *xdp_xmit) skb = veth_xdp_rcv_skb(rq, ptr, xdp_xmit); } - if (skb) + if (skb) { + skb_orphan(skb); napi_gro_receive(&rq->xdp_napi, skb); - + } done++; }